Get minimum of minimums per column- Pandas - python

I have the following data frame:
index timestamp Output_Energy Elevation one two three
.
.
538 2016-06-20 08:58:00+05:30 40.34829924338887 44.04129964199598 0.0 0.0 0.0
539 2016-06-20 08:59:00+05:30 40.5703298574816 44.25962644894399 0.0 0.0 0.0
540 2016-06-20 09:00:00+05:30 40.79141282764114 44.47799109475774 25.0 30.0 40.0
541 2016-06-20 09:01:00+05:30 41.01157539726741 44.69639316356193 0.0 0.0 0.0
542 2016-06-20 09:02:00+05:30 41.230790026853384 44.91483208973582 0.0 0.0 0.0
543 2016-06-20 09:03:00+05:30 41.44905311289469 45.13330730580419 0.0 0.0 0.0
544 2016-06-20 09:04:00+05:30 41.666364098311895 45.351818241921414 0.0 0.0 0.0
545 2016-06-20 09:05:00+05:30 41.8827166483967 45.57036432534446 0.0 0.0 0.0
546 2016-06-20 09:06:00+05:30 42.0981013074145 45.78894498194398 0.0 0.0 0.0
547 2016-06-20 09:07:00+05:30 42.31250713341641 46.007559633636575 0.0 0.0 0.0
548 2016-06-20 09:08:00+05:30 42.525960667204465 46.22620769987313 0.0 0.0 0.0
549 2016-06-20 09:09:00+05:30 42.738418433471544 46.44488859711554 0.0 0.0 0.0
550 2016-06-20 09:10:00+05:30 42.94990329039521 46.66360188496949 0.0 0.0 0.0
551 2016-06-20 09:11:00+05:30 43.160390522060574 46.88234668098421 0.0 0.0 0.0
552 2016-06-20 09:12:00+05:30 43.36988302062059 47.101122538059016 0.0 0.0 0.0
553 2016-06-20 09:13:00+05:30 43.57837124543777 47.319928859306344 0.0 0.0 0.0
554 2016-06-20 09:14:00+05:30 43.785848829859404 47.53876504413286 0.0 0.0 0.0
555 2016-06-20 09:15:00+05:30 43.992319514156094 47.75763048766155 0.0 0.0 0.0
556 2016-06-20 09:16:00+05:30 44.19779793479642 47.976524582260204 0.0 0.0 0.0
557 2016-06-20 09:17:00+05:30 44.402250365403276 48.19544671486692 0.0 0.0 0.0
558 2016-06-20 09:18:00+05:30 44.6056725858456 48.4143962684937 0.0 0.0 0.0
559 2016-06-20 09:19:00+05:30 44.80807132168398 48.63337262163986 0.0 0.0 0.0
560 2016-06-20 09:20:00+05:30 45.00943825682395 48.852375147711754 0.0 0.0 0.0
.
.
I am trying to find 'single' minimum value in the columns: one, two and three by following
i=0
for i in range(0, len(df_temp)-1):
target_batteries = ["one", "two", "three"]
global_min_SOC = df_temp[target_batteries].min().min()
min_SOC_battery = (df_temp[target_batteries].min() == global_min_SOC)
min_SOC = df_temp[target_batteries].idxmin(axis=1)
i+=1
min_SOC_battery
Somehow I am getting global_min_SOC = 0.0, min_SOC_battery as
one True
two True
three True
dtype: bool
and min_SOC as
0 one
1 one
2 one
3 one
4 one
...
2876 one
2877 one
2878 one
2879 one
2880 one
Length: 2881, dtype: object
Expected ouput is global_min_SOC = 25.0, min_SOC = one and min_SOC_battery as
one True
two False
three False
dtype: bool
What I am doing wrong?
In addition, how can I access the value of min_SOC_battery? I want to call a function on it.
Thanks in advance!

Related

Sum values in df1 based on .eq() in df2

I want to sum up some market volumes based on equal prices in, let's say, 6 hours of 2017.
I have a DataFrame, df1 (market_volumes), that contains the market volumes in some areas. Then I have another DataFrame, df2 (mFRR_price), which contains some market prices.
df1
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
df2
Date NO1 Up NO2 Up NO3 Up ... SE4 Up FI Up DK1 Up DK2 Up
35062 31-12-2020 47.4 47.4 27.2 ... 61.1 61.1 94.1 94.1
35063 31-12-2020 31.0 31.0 25.7 ... 58.0 35.3 89.4 89.4
35064 31-12-2020 24.8 24.8 24.8 ... 54.5 24.8 56.7 56.7
35065 31-12-2020 24.8 24.8 24.8 ... 51.2 28.0 52.4 52.4
35066 31-12-2020 24.6 24.6 24.6 ... 45.8 26.6 51.9 51.9
35067 31-12-2020 24.1 24.1 23.3 ... 24.1 24.1 78.7 78.7
Now, I want to sum up the market volumes from df1 IF the values in a row in df2 are equal to the value in column "NO1 UP".
i.e., I am looking for a way to end up with a new DataFrame that would result in:
df3
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down SUM
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0 (54+214)
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0 (3+121)
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0 etc.
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
... because it locates the area prices that are equal and sums the market volumes on those locations in the DataFrame.
I've been working on this:
market_volumes['sum'] = mFRR_price.eq(mFRR_price['NO1 Up'], axis=0).mul(mFRR_price['NO1 Up'], axis=0).sum(axis=1)
But it sums the values in df2 in puts it in the df1. I need the POSITIONS in df2, but the values from df1.
import pandas as pd
df3['SUM'] = df3['NO1 Up'] + df3['NO2 Up']
You can use .loc and apply boolean indexing.
df1.loc[df2['NO1 Up'] == df2['NO2 Up'], 'SUM'] = df1['NO1 Up'] + df1['NO2 Up']
df1.loc[df2['NO1 Up'] != df2['NO2 Up'], 'SUM'] = 0
First line goes down df2's index and checks if values in columns NO1 Up and NO2 Up are equal. It then creates a column called 'SUM' - the value of this new column is dependent on the outcome of the preceding boolean.. We say if preceding boolean is true, then go to the SUM column and do the below operation:
= df1['NO1 Up'] + df1['NO2 Up']
Conversely, if the outcome false, then pandas will insert 'NaN' into your SUM column.
Not sure if you are ok with NaN values. Most are not, so the second line of code is more or less the inverse of the first... If df2['NO1 Up'] != df2['NO2 Up'], then insert integer 0 in the df1 SUM column.
Again, there are probably other ways to accomplish what you want.

faster way to calculate a rolling sum in a dataframe

To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).
I need a faster way to get a value from the previous row and add it to a value on the current row.
I have the following dataframe:
import pandas as pd
from itertools import repeat
df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')
df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
Which is
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 0.0 0.0 0.0 0.0
15:00 1740 102.7 6.5 667.55 0.0 0.0 0.0 0.0
14:00 1680 108.5 5.4 585.90 0.0 0.0 0.0 0.0
13:00 1620 105.3 6.3 663.39 0.0 0.0 0.0 0.0
12:00 1560 107.1 6.4 685.44 0.0 0.0 0.0 0.0
11:00 1500 103.4 7.1 734.14 0.0 0.0 0.0 0.0
10:00 1440 101.8 6.7 682.06 0.0 0.0 0.0 0.0
09:00 1380 102.7 6.2 636.74 0.0 0.0 0.0 0.0
08:00 1320 101.6 5.7 579.12 0.0 0.0 0.0 0.0
07:00 1260 99.8 1.2 119.76 0.0 0.0 0.0 0.0
06:00 1200 100.2 2.4 240.48 0.0 0.0 0.0 0.0
05:00 1140 97.7 3.9 381.03 0.0 0.0 0.0 0.0
04:00 1080 99.3 5.2 516.36 0.0 0.0 0.0 0.0
03:00 1020 100.1 8.9 890.89 0.0 0.0 0.0 0.0
02:00 960 102.5 7.2 738.00 0.0 0.0 0.0 0.0
01:00 900 103.9 6.5 675.35 0.0 0.0 0.0 0.0
Right now I am using a for loop to check each row if 'flag' is set.
#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
# if flag[row] is not set:
# PVsum_2[row] = PV[row] + PV[row + 1]
# Vsum_2[row] = volume[row] + volume[row + 1]
# VWMA_2[row] = PVsum_2[row] / Vsum_2[row]
# flag[row] = 1.0
#----pseudo code----
my_dict = {'dtime' : 0,
'time' : 1,
'price' : 2,
"volume" : 3,
'PV' : 4,
'check' : 5,
'PVsum_2': 6,
'Vsum_2' : 7,
'VWMA_2' : 8}
for row in reversed(range(len(df)-1)):
# if flag value is not set (i.e. flag == 0)
if not df['flag'][row]:
# sum of current and previous PV (price*volume) values
a = df['PV'][row] + df['PV'][row+1]
df.iloc[row, my_dict['PVsum_2']-1] = a
# sum of current and previous volumes
b = df['volume'][row] + df['volume'][row+1]
df.iloc[row, my_dict['Vsum_2']-1] = b
# PVsum_2 / Vsum_2
c = (a / b) if b != 0.0 else 0.0
df.iloc[row, my_dict['VWMA_2']-1] = c
# set check value to 1.0
df.iloc[row, my_dict['flag']-1] = 1.0
but this takes too long on large sets of data (500+ rows)
I'm looking for something faster and more elegant.
The dataframe should look like this when it is done (notice the bottom row has not been altered):
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
19:00 1980 100.1 6.0 600.60 0.0 0.0 0.0 0.0
18:00 1920 102.7 6.5 667.55 0.0 0.0 0.0 0.0
17:00 1860 108.5 5.4 585.90 0.0 0.0 0.0 0.0
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.
For starters, the following lines
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
can be much easier written as:
df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0
But it seems you don't even need to initialise those columns really.
You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:
df_shift = df.shift(-1)
You can then use normal vectorised calculations to achieve what you want, e.g.:
df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0 # this is your check whether that value is different from 0
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2'] # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0
Hopefully you get the idea and can make small adjustments to make it work exactly as you want.

How to scale all data within a dataframe but one columns

My data look like:
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.3 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.3 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.3 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.3 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.3 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044
and I want to rescale the dataframe but cycles columns.
I've tried:
# Scale features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
for col in data[1:].columns:
data[col] = sc.fit_transform(data[col].values.reshape(-1,1))
but it scales whole dataframe either way.
Help would be appreciated.
Thanks!
You can select all columns without first by DataFrame.iloc, here first : means all rows and 1: all columns without first and pass to fit_transform:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data.iloc[:, 1:] = sc.fit_transform(data.iloc[:, 1:])
print (data)
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 \
0 1 0.074613 -1.128152 0.0 0.0 -1.847703 0.732430 -1.349237 0.0
1 2 1.287080 -0.725241 0.0 0.0 -0.276203 1.313986 -0.034171 0.0
2 3 -1.604187 1.692228 0.0 0.0 0.676221 0.263346 0.514636 0.0
3 4 0.727480 0.483494 0.0 0.0 0.676221 -1.163111 -0.691704 0.0
4 5 -0.484987 -0.322329 0.0 0.0 0.771464 -1.146651 1.560476 0.0
sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 \
0 0.0 0.765578 -0.422577 -0.822800 0.0 1.050958 -1.607627 -1.209416
1 0.0 -1.617089 -1.267731 -1.339486 0.0 1.198981 -0.005169 0.950255
2 0.0 0.374977 0.422577 0.822312 0.0 -0.429265 0.356676 -0.777482
3 0.0 1.117119 1.690309 -0.020960 0.0 -1.465421 1.493904 1.382189
4 0.0 -0.640586 -0.422577 1.360934 0.0 -0.355254 -0.237784 -0.345547
sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1.866394 0.265372 0.0 0.204124 0.0 0.0 1.549015 0.867102
1 -1.140246 0.795254 0.0 0.204124 0.0 0.0 0.637830 1.020631
2 -0.406508 0.192136 0.0 -1.837117 0.0 0.0 -0.121491 -1.629404
3 -0.153495 -1.944623 0.0 0.204124 0.0 0.0 -1.184541 -0.638144
4 -0.166145 0.691862 0.0 1.224745 0.0 0.0 -0.880812 0.379816

Can I pass a list of column names into get_dummies() to use as the column label for all possible answers?

(EDITED: I just realised I think I am asking a question that cannot be answered but not sure how to delete this question... please ignore or advise on how I can delete. I think I need to think about a different way to approach this problem.)
******----------------------------*****
I have a DataFrame called user_answers, this DataFrame is formed using get_dummies(). It looks like this
Index,Q1_1,Q1_2,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,0
2,0,0,1,0,1,0
3,0,1,0,0,1,1
4,1,0,0,0,1,0
5,0,0,0,1,1,0
6,0,0,1,0,1,1
7,0,1,0,0,1,1
I am needing to do a comparison against a similar DataFrame called DF_answers. That DataFrame looks like this
Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,1,0
2,1,0,0,0,0,1,0
3,0,1,0,0,0,1,1
4,0,0,1,0,0,1,0
5,0,0,0,0,1,1,0
6,1,0,0,0,0,1,1
7,0,0,0,1,0,1,1
The problem I am having is when I use 'get_dummies' does not create a column in the user_answers dataframe for Q1_3 assuming that the user didn't select Q1_3 as an answer in any of the 7 questions in the original questionnaire. I need to try get my output of user_answers to look like this. So even if the user did not answer Q1_3 on any of the 7 questions the get_dummies will still output a column Q1_3 filled with zeros as per illustration below.
Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,0,0
2,1,0,0,0,0,1,0
3,0,1,0,0,0,1,1
4,1,0,0,0,0,1,0
5,0,0,0,1,0,1,0
6,1,0,0,0,0,1,1
7,1,0,0,0,0,1,1
I think I have over thought this so much im possibly over thinking things. I read that you can pass in a list of column names into get_dummies()
Sorry for the delay,
find my attempt below:
from what I understand the following applies
You have a dataframe which only has questions which the user filled out.
you need to merge this onto a frame which has every question for some sort of further anaylsis?
if this is true this is my noobish attempt:
cols = ['ID','Q1_1','Q1_2','Q1_4','Q1_5','mas_Y','fhae_Y']
data = []
for x in enumerate(cols):
data.append(np.random.randint(0,150,size=150))
df = pd.DataFrame(dict(zip(cols,data)))
print(df.head())
ID Q1_1 Q1_2 Q1_4 Q1_5 mas_Y fhae_Y
0 7 76 41 46 57 75 139
1 11 118 65 38 17 116 75
2 111 104 109 110 32 53 106
3 131 14 92 128 14 22 65
4 83 72 148 99 103 133 144
## Create a dummy frame
cols_b = ['ID']
x = 0
for i in range(1,101):
cols_b.append('Q1_' + str(x+i))
data_b = []
for x in enumerate(cols_b):
data_b.append(np.nan)
df2 = pd.DataFrame(dict(zip(cols_b,data_b)),index=[0])
final_cols = list(df2.columns)
final_cols.append('fhae_Y')
final_cols.append('mas_Y')
df = pd.merge(df,df2,how='left')
print(df[final_cols].fillna(0).head(5))
ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 ... Q1_93 Q1_94 Q1_95 Q1_96 Q1_97 Q1_98 Q1_99 Q1_100 fhae_Y mas_Y
0 7 76 41 0.0 46 57 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 139 75
1 11 118 65 0.0 38 17 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75 116
2 111 104 109 0.0 110 32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106 53
3 131 14 92 0.0 128 14 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 65 22
4 83 72 148 0.0 99 103 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 144

astype() does not change floats

Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.

Categories