Adding hours values each 3 columns in Python - python

Imagine to have a raw dataframe like the following:
What I would like to have in order to be able to work on the data is to rearrange it so that every 3 columns (that represent the hourly values of each day) creating a new row with the date time values (e.g. 2015-05-31 00:00:00, 2015-05-31 01:00:00, 2015-05-31 02:00:00, etc.) in order to end up with just 4 columns: Date, Tmin, Tmax, and Nsum.
Here the raw dictionary from the imported CSV (just a few rows):
{'Date': {0: '2015-04-30', 1: '2015-05-01', 2: '2015-05-02', 3: '2015-05-03', 4: '2015-05-04'}, 'T min °C': {0: 11.7, 1: 8.3, 2: 8.3, 3: 11.6, 4: 12.4}, 'T max °C': {0: 11.9, 1: 8.9, 2: 8.4, 3: 11.8, 4: 12.7}, 'N sum mm': {0: 0.0, 1: 0.0, 2: 0.6, 3: 1.9, 4: 0.0}, 'T min °C.1': {0: 11.6, 1: 8.0, 2: 8.3, 3: 11.4, 4: 12.4}, 'T max °C.1': {0: 11.8, 1: 8.2, 2: 8.3, 3: 11.6, 4: 12.4}, 'N sum mm.1': {0: 0.0, 1: 0.1, 2: 0.6, 3: 0.3, 4: 0.0}, 'T min °C.2': {0: 10.2, 1: 7.9, 2: 8.2, 3: 11.1, 4: 12.2}, 'T max °C.2': {0: 11.2, 1: 8.1, 2: 8.3, 3: 11.4, 4: 12.3}, 'N sum mm.2': {0: 0.0, 1: 0.0, 2: 1.5, 3: 0.2, 4: 0.0}, 'T min °C.3': {0: 9.2, 1: 7.5, 2: 8.1, 3: 11.0, 4: 12.1}, 'T max °C.3': {0: 9.8, 1: 7.8, 2: 8.2, 3: 11.1, 4: 12.2}, 'N sum mm.3': {0: 0.0, 1: 0.0, 2: 0.4, 3: 0.0, 4: 0.0}, 'T min °C.4': {0: 8.8, 1: 7.0, 2: 8.2, 3: 10.8, 4: 12.0}, 'T max °C.4': {0: 9.2, 1: 7.5, 2: 8.2, 3: 10.9, 4: 12.1}, 'N sum mm.4': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.1, 4: 0.0}, 'T min °C.5': {0: 8.4, 1: 7.0, 2: 8.2, 3: 10.6, 4: 11.9}, 'T max °C.5': {0: 8.6, 1: 7.1, 2: 8.3, 3: 10.8, 4: 12.1}, 'N sum mm.5': {0: 0.1, 1: 0.0, 2: 0.0, 3: 0.2, 4: 0.0}, 'T min °C.6': {0: 8.6, 1: 6.9, 2: 8.1, 3: 10.5, 4: 11.8}, 'T max °C.6': {0: 8.7, 1: 7.0, 2: 8.3, 3: 10.6, 4: 11.9}, 'N sum mm.6': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.1, 4: 0.0}, 'T min °C.7': {0: 8.5, 1: 6.8, 2: 8.4, 3: 10.4, 4: 11.8}, 'T max °C.7': {0: 8.7, 1: 7.0, 2: 8.9, 3: 10.5, 4: 12.0}, 'N sum mm.7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.2, 4: 0.2}, 'T min °C.8': {0: 8.4, 1: 7.0, 2: 9.1, 3: 10.6, 4: 12.0}, 'T max °C.8': {0: 8.4, 1: 7.2, 2: 10.8, 3: 10.8, 4: 12.8}, 'N sum mm.8': {0: 1.4, 1: 0.0, 2: 0.0, 3: 0.1, 4: 0.0}, 'T min °C.9': {0: 7.0, 1: 7.3, 2: 11.2, 3: 10.9, 4: 13.0}, 'T max °C.9': {0: 8.3, 1: 7.8, 2: 12.5, 3: 11.4, 4: 13.5}, 'N sum mm.9': {0: 2.9, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.10': {0: 6.7, 1: 8.0, 2: 12.3, 3: 11.5, 4: 13.6}, 'T max °C.10': {0: 6.9, 1: 8.2, 2: 13.9, 3: 12.3, 4: 14.8}, 'N sum mm.10': {0: 2.9, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.11': {0: 6.5, 1: 8.2, 2: 14.5, 3: 12.3, 4: 15.0}, 'T max °C.11': {0: 6.6, 1: 8.5, 2: 16.1, 3: 12.7, 4: 15.8}, 'N sum mm.11': {0: 3.7, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.12': {0: 6.7, 1: 8.3, 2: 16.3, 3: 12.8, 4: 15.9}, 'T max °C.12': {0: 7.3, 1: 8.4, 2: 17.6, 3: 13.4, 4: 16.3}, 'N sum mm.12': {0: 1.1, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.13': {0: 7.6, 1: 8.4, 2: 17.8, 3: 13.6, 4: 16.3}, 'T max °C.13': {0: 8.8, 1: 8.5, 2: 18.6, 3: 13.9, 4: 17.0}, 'N sum mm.13': {0: 0.0, 1: 0.1, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.14': {0: 9.5, 1: 8.6, 2: 19.2, 3: 14.1, 4: 17.0}, 'T max °C.14': {0: 11.4, 1: 9.1, 2: 19.8, 3: 14.3, 4: 17.3}, 'N sum mm.14': {0: 0.0, 1: 0.3, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.15': {0: 11.4, 1: 9.0, 2: 20.0, 3: 14.4, 4: 16.7}, 'T max °C.15': {0: 12.6, 1: 9.1, 2: 20.5, 3: 15.0, 4: 17.0}, 'N sum mm.15': {0: 0.0, 1: 0.4, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.16': {0: 12.6, 1: 9.1, 2: 20.0, 3: 14.8, 4: 16.8}, 'T max °C.16': {0: 13.4, 1: 9.3, 2: 20.4, 3: 14.9, 4: 17.1}, 'N sum mm.16': {0: 0.0, 1: 0.2, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.17': {0: 13.7, 1: 9.2, 2: 19.6, 3: 14.6, 4: 16.3}, 'T max °C.17': {0: 14.1, 1: 9.3, 2: 20.0, 3: 14.7, 4: 16.5}, 'N sum mm.17': {0: 0.0, 1: 0.1, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.18': {0: 12.9, 1: 8.9, 2: 17.7, 3: 14.2, 4: 16.0}, 'T max °C.18': {0: 13.9, 1: 9.1, 2: 19.4, 3: 14.6, 4: 16.4}, 'N sum mm.18': {0: 0.0, 1: 0.1, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.19': {0: 11.0, 1: 8.7, 2: 16.0, 3: 14.0, 4: 15.3}, 'T max °C.19': {0: 12.2, 1: 8.9, 2: 17.9, 3: 14.1, 4: 16.1}, 'N sum mm.19': {0: 0.0, 1: 0.2, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.20': {0: 9.9, 1: 8.6, 2: 14.6, 3: 13.2, 4: 14.5}, 'T max °C.20': {0: 10.9, 1: 8.7, 2: 16.0, 3: 14.0, 4: 15.4}, 'N sum mm.20': {0: 0.0, 1: 0.7, 2: 0.0, 3: 0.0, 4: 0.0}, 'T min °C.21': {0: 10.2, 1: 8.6, 2: 13.8, 3: 12.8, 4: 14.2}, 'T max °C.21': {0: 10.5, 1: 8.6, 2: 14.9, 3: 13.4, 4: 14.9}, 'N sum mm.21': {0: 0.0, 1: 1.5, 2: 0.2, 3: 0.0, 4: 0.0}, 'T min °C.22': {0: 9.1, 1: 8.5, 2: 12.1, 3: 12.8, 4: 13.8}, 'T max °C.22': {0: 10.2, 1: 8.5, 2: 13.2, 3: 12.9, 4: 14.3}, 'N sum mm.22': {0: 0.0, 1: 1.3, 2: 0.7, 3: 0.0, 4: 0.0}, 'T min °C.23': {0: 9.1, 1: 8.4, 2: 11.9, 3: 12.7, 4: 13.4}, 'T max °C.23': {0: 9.6, 1: 8.4, 2: 12.7, 3: 12.8, 4: 14.1}, 'N sum mm.23': {0: 0.0, 1: 1.3, 2: 2.1, 3: 0.0, 4: 0.0}}

First create DatetimeIndex, then reshape values for 3 columns, create new index by numpy.repeat:
df = df.set_index('Date')
df = pd.DataFrame(df.values.reshape(-1, 3),
index=pd.to_datetime(np.repeat(df.index, len(df.columns) // 3)),
columns=['Tmin', 'Tmax', 'Nsum'])
Last add hours by converted modulo to timedeltas:
df.index += pd.to_timedelta(np.arange(len(df)) % 24, unit='h')
df = df.rename_axis('Date').reset_index()
print (df.head(30))
Date Tmin Tmax Nsum
0 2015-04-30 00:00:00 11.7 11.9 0.0
1 2015-04-30 01:00:00 11.6 11.8 0.0
2 2015-04-30 02:00:00 10.2 11.2 0.0
3 2015-04-30 03:00:00 9.2 9.8 0.0
4 2015-04-30 04:00:00 8.8 9.2 0.0
5 2015-04-30 05:00:00 8.4 8.6 0.1
6 2015-04-30 06:00:00 8.6 8.7 0.0
7 2015-04-30 07:00:00 8.5 8.7 0.0
8 2015-04-30 08:00:00 8.4 8.4 1.4
9 2015-04-30 09:00:00 7.0 8.3 2.9
10 2015-04-30 10:00:00 6.7 6.9 2.9
11 2015-04-30 11:00:00 6.5 6.6 3.7
12 2015-04-30 12:00:00 6.7 7.3 1.1
13 2015-04-30 13:00:00 7.6 8.8 0.0
14 2015-04-30 14:00:00 9.5 11.4 0.0
15 2015-04-30 15:00:00 11.4 12.6 0.0
16 2015-04-30 16:00:00 12.6 13.4 0.0
17 2015-04-30 17:00:00 13.7 14.1 0.0
18 2015-04-30 18:00:00 12.9 13.9 0.0
19 2015-04-30 19:00:00 11.0 12.2 0.0
20 2015-04-30 20:00:00 9.9 10.9 0.0
21 2015-04-30 21:00:00 10.2 10.5 0.0
22 2015-04-30 22:00:00 9.1 10.2 0.0
23 2015-04-30 23:00:00 9.1 9.6 0.0
24 2015-05-01 00:00:00 8.3 8.9 0.0
25 2015-05-01 01:00:00 8.0 8.2 0.1
26 2015-05-01 02:00:00 7.9 8.1 0.0
27 2015-05-01 03:00:00 7.5 7.8 0.0
28 2015-05-01 04:00:00 7.0 7.5 0.0
29 2015-05-01 05:00:00 7.0 7.1 0.0

Related

Convert `DataFrame.groupby()` into dictionary (and then reverse it)

Say I have the following DataFrame() where I have repeated observations per individual (column id_ind). Hence, first two rows belong the first individual, the third and fourth rows belong to the second individual, and so forth...
import pandas as pd
X = pd.DataFrame.from_dict({'x1_1': {0: -0.1766214634108258, 1: 1.645852185286492, 2: -0.13348860101031038, 3: 1.9681043689968933, 4: -1.7004428240831382, 5: 1.4580091413853749, 6: 0.06504113741068565, 7: -1.2168493676768384, 8: -0.3071304478616376, 9: 0.07121332925591593}, 'x1_2': {0: -2.4207773498298844, 1: -1.0828751040719462, 2: 2.73533787008624, 3: 1.5979611987152071, 4: 0.08835542172064115, 5: 1.2209786277076156, 6: -0.44205979195950784, 7: -0.692872860268244, 8: 0.0375521181289943, 9: 0.4656030062266639}, 'x1_3': {0: -1.548320898226322, 1: 0.8457342014424675, 2: -0.21250514722879738, 3: 0.5292389938329516, 4: -2.593946520223666, 5: -0.6188958526077123, 6: 1.6949245117526974, 7: -1.0271341091035742, 8: 0.637561891142571, 9: -0.7717170035055559}, 'x2_1': {0: 0.3797245517345564, 1: -2.2364391598508835, 2: 0.6205947900678905, 3: 0.6623865847688559, 4: 1.562036259999875, 5: -0.13081282910947759, 6: 0.03914373833251773, 7: -0.995761652421108, 8: 1.0649494418154162, 9: 1.3744782478849122}, 'x2_2': {0: -0.5052556836786106, 1: 1.1464291788297152, 2: -0.5662380273138174, 3: 0.6875729143723538, 4: 0.04653136473130827, 5: -0.012885303852347407, 6: 1.5893672346098884, 7: 0.5464286050059511, 8: -0.10430829457707284, 9: -0.5441755265313813}, 'x2_3': {0: -0.9762973303149007, 1: -0.983731467806563, 2: 1.465827578266328, 3: 0.5325950414202745, 4: -1.4452121324204903, 5: 0.8148816373643869, 6: 0.470791989780882, 7: -0.17951636294180473, 8: 0.7351814781280054, 9: -0.28776723200679066}, 'x3_1': {0: 0.12751822396637064, 1: -0.21926633684030983, 2: 0.15758799357206943, 3: 0.5885412224632464, 4: 0.11916562911189271, 5: -1.6436210334529249, 6: -0.12444368631987467, 7: 1.4618564171802453, 8: 0.6847234328916137, 9: -0.23177118858569187}, 'x3_2': {0: -0.6452955690715819, 1: 1.052094761527654, 2: 0.20190339195326157, 3: 0.6839430295237913, 4: -0.2607691613858866, 5: 0.3315513026670213, 6: 0.015901139336566113, 7: 0.15243420084881903, 8: -0.7604225072161022, 9: -0.4387652927008854}, 'x3_3': {0: -1.067058994377549, 1: 0.8026914180717286, 2: -1.9868531745912268, 3: -0.5057770735303253, 4: -1.6589569342151713, 5: 0.358172252880764, 6: 1.9238983803281329, 7: 2.2518318810978246, 8: -1.2781475121874357, 9: -0.7103081175166167}})
Y = pd.DataFrame.from_dict({'CHOICE': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 2.0, 6: 1.0, 7: 1.0, 8: 2.0, 9: 2.0}})
Z = pd.DataFrame.from_dict({'z1': {0: 2.4196730570917233, 1: 2.4196730570917233, 2: 2.822802255159467, 3: 2.822802255159467, 4: 2.073171091633643, 5: 2.073171091633643, 6: 2.044165101485163, 7: 2.044165101485163, 8: 2.4001241292606275, 9: 2.4001241292606275}, 'z2': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 0.0, 9: 0.0}, 'z3': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 2.0, 5: 2.0, 6: 2.0, 7: 2.0, 8: 3.0, 9: 3.0}})
id = pd.DataFrame.from_dict({'id_choice': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0, 6: 7.0, 7: 8.0, 8: 9.0, 9: 10.0}, 'id_ind': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 4.0, 7: 4.0, 8: 5.0, 9: 5.0}} )
# Create a dataframe with all the data
data = pd.concat([id, X, Z, Y], axis=1)
print(data.head(4))
# id_choice id_ind x1_1 x1_2 x1_3 x2_1 x2_2 \
# 0 1.0 1.0 -0.176621 -2.420777 -1.548321 0.379725 -0.505256
# 1 2.0 1.0 1.645852 -1.082875 0.845734 -2.236439 1.146429
# 2 3.0 2.0 -0.133489 2.735338 -0.212505 0.620595 -0.566238
# 3 4.0 2.0 1.968104 1.597961 0.529239 0.662387 0.687573
#
# x2_3 x3_1 x3_2 x3_3 z1 z2 z3 CHOICE
# 0 -0.976297 0.127518 -0.645296 -1.067059 2.419673 0.0 1.0 1.0
# 1 -0.983731 -0.219266 1.052095 0.802691 2.419673 0.0 1.0 1.0
# 2 1.465828 0.157588 0.201903 -1.986853 2.822802 0.0 1.0 2.0
# 3 0.532595 0.588541 0.683943 -0.505777 2.822802 0.0 1.0 2.0
I want to perform two operations.
First, I want to convert the DataFrame data into a dictionary of DataFrame()s where the keys are the number of individuals (in this particular case, numbers ranging from 1.0 to 5.0.). I've done this below as suggested here. Unfortunately, I am getting a dictionary of numpy values and not a dictionary of DataFrame()s.
# Create a dictionary with the data for each individual
data_dict = data.set_index('id_ind').groupby('id_ind').apply(lambda x : x.to_numpy().tolist()).to_dict()
print(data_dict.keys())
# dict_keys([1.0, 2.0, 3.0, 4.0, 5.0])
print(data_dict[1.0])
#[[1.0, -0.1766214634108258, -2.4207773498298844, -1.548320898226322, 0.3797245517345564, -0.5052556836786106, -0.9762973303149007, 0.12751822396637064, -0.6452955690715819, -1.067058994377549, 2.4196730570917233, 0.0, 1.0, 1.0], [2.0, 1.645852185286492, -1.0828751040719462, 0.8457342014424675, -2.2364391598508835, 1.1464291788297152, -0.983731467806563, -0.21926633684030983, 1.052094761527654, 0.8026914180717286, 2.4196730570917233, 0.0, 1.0, 1.0]]
Second, I want to recover the original DataFrame data reversing the previous operation. The naive approach is as follows. However, it is, of course, not producing the expected result.
# Naive approach
res = pd.DataFrame.from_dict(data_dict, orient='index')
print(res)
# 0 1
#1.0 [1.0, -0.1766214634108258, -2.4207773498298844... [2.0, 1.645852185286492, -1.0828751040719462, ...
#2.0 [3.0, -0.13348860101031038, 2.73533787008624, ... [4.0, 1.9681043689968933, 1.5979611987152071, ...
#3.0 [5.0, -1.7004428240831382, 0.08835542172064115... [6.0, 1.4580091413853749, 1.2209786277076156, ...
#4.0 [7.0, 0.06504113741068565, -0.4420597919595078... [8.0, -1.2168493676768384, -0.692872860268244,...
#5.0 [9.0, -0.3071304478616376, 0.0375521181289943,... [10.0, 0.07121332925591593, 0.4656030062266639...
This solution was inspired by #mozway comments.
# Create a dictionary with the data for each individual
data_dict = dict(list(data.groupby('id_ind')))
# Convert the dictionary into a dataframe
res = pd.concat(data_dict, axis=0).reset_index(drop=True)
print(res.head(4))
# id_choice id_ind x1_1 x1_2 x1_3 x2_1 x2_2 \
#0 1.0 1.0 -0.176621 -2.420777 -1.548321 0.379725 -0.505256
#1 2.0 1.0 1.645852 -1.082875 0.845734 -2.236439 1.146429
#2 3.0 2.0 -0.133489 2.735338 -0.212505 0.620595 -0.566238
#3 4.0 2.0 1.968104 1.597961 0.529239 0.662387 0.687573
#
# x2_3 x3_1 x3_2 x3_3 z1 z2 z3 CHOICE
#0 -0.976297 0.127518 -0.645296 -1.067059 2.419673 0.0 1.0 1.0
#1 -0.983731 -0.219266 1.052095 0.802691 2.419673 0.0 1.0 1.0
#2 1.465828 0.157588 0.201903 -1.986853 2.822802 0.0 1.0 2.0
#3 0.532595 0.588541 0.683943 -0.505777 2.822802 0.0 1.0 2.0

Flatting out a multiindex dataframe

I have a df:
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29'), 'sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06'), 'sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28'), 'sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07'), 'sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
I am trying to sort of flatten it so it would no longer be a multiindex df. As there are several dates I try to select one:
df.loc[:,'2020-06-29 00:00:00']
But this gives me an error :
KeyError: '2020-06-29 00:00:00'
I am trying to make it that the first week ( and my final output ) of 2020-06-29 would look like this :
group category last_sales sales difference
A Amazon 195.00 1,268.85 0.00
A Apple 61.00 18,274.39 0.00
A Facebook 106.00 19,722.65 0.00
A Google 61.00 55,547.25 0.00
A Netflix 37.00 15,323.80 0.00
A Tesla 13.00 1,688.67 0.00
A Total 954.00 227,463.23 0.00
A Uber 4.00 1,906.00 0.00
A total 477.00 113,731.62 0.00
B Amazon 0.00 3,219.65 0.00
B Apple 50.00 15,852.06 0.00
B Facebook 75.00 17,743.70 0.00
B Google 43.00 37,795.15 0.00
B Netflix 17.00 5,918.50 0.00
B Tesla 14.00 1,708.75 0.00
B Total 504.00 166,349.64 0.00
B Uber 3.00 937.01 0.00
B total 252.00 83,174.82 0.00
all Total 2,916.00 787,625.74 0.00
try via pd.to_dateime():
out=df.loc[:,pd.to_datetime('2020-06-29 00:00:00')]
#out=df.loc[:,pd.to_datetime('2020-06-29 00:00:00')].reset_index()
OR
try via pd.Timestamp()
out=df.loc[:,pd.Timestamp('2020-06-29 00:00:00')]
#out=df.loc[:,pd.Timestamp('2020-06-29 00:00:00')].reset_index()
The 0th level of your column is Timestamp and you can verify that by:
df.columns.to_numpy()
#output
array([(Timestamp('2020-06-29 00:00:00'), 'last_sales'),
(Timestamp('2020-06-29 00:00:00'), 'sales'),
(Timestamp('2020-06-29 00:00:00'), 'difference'),
(Timestamp('2020-07-06 00:00:00'), 'last_sales'),
(Timestamp('2020-07-06 00:00:00'), 'sales'),
(Timestamp('2020-07-06 00:00:00'), 'difference'),
(Timestamp('2021-06-28 00:00:00'), 'last_sales'),
(Timestamp('2021-06-28 00:00:00'), 'sales'),
(Timestamp('2021-06-28 00:00:00'), 'difference'),
(Timestamp('2021-07-07 00:00:00'), 'last_sales'),
(Timestamp('2021-07-07 00:00:00'), 'sales'),
(Timestamp('2021-07-07 00:00:00'), 'difference')], dtype=object)
output of out:
last_sales sales difference
group category
A Amazon 195.0 1268.850 0.0
Apple 61.0 18274.385 0.0
Facebook 106.0 19722.650 0.0
Google 61.0 55547.255 0.0
Netflix 37.0 15323.800 0.0
Tesla 13.0 1688.675 0.0
Total 954.0 227463.230 0.0
Uber 4.0 1906.000 0.0
total 477.0 113731.615 0.0
B Amazon 50.0 3219.650 0.0
Apple 50.0 15852.060 0.0
Facebook 75.0 17743.700 0.0
Google 43.0 37795.150 0.0
Netflix 17.0 5918.500 0.0
Tesla 14.0 1708.750 0.0
Total 504.0 166349.640 0.0
Uber 3.0 937.010 0.0
total 252.0 83174.820 0.0
all Total 2916.0 787625.740 0.0
total 2916.0 787625.740 0.0
NOTE:
There is no need of providing a tuple in .loc[] because you are selecting the 0th level
I’m also getting a KeyError, but if you use a Timestamp object to index the first-level columns, it seems to work:
>>> df[pd.Timestamp('2020-06-29 00:00:00')]
last_sales sales difference
group category
A Amazon 195.0 1268.850 0.0
Apple 61.0 18274.385 0.0
Facebook 106.0 19722.650 0.0
Google 61.0 55547.255 0.0
Netflix 37.0 15323.800 0.0
Tesla 13.0 1688.675 0.0
Total 954.0 227463.230 0.0
Uber 4.0 1906.000 0.0
total 477.0 113731.615 0.0
B Amazon 50.0 3219.650 0.0
Apple 50.0 15852.060 0.0
Facebook 75.0 17743.700 0.0
Google 43.0 37795.150 0.0
Netflix 17.0 5918.500 0.0
Tesla 14.0 1708.750 0.0
Total 504.0 166349.640 0.0
Uber 3.0 937.010 0.0
total 252.0 83174.820 0.0
all Total 2916.0 787625.740 0.0
total 2916.0 787625.740 0.0
Otherwise you could use .xs which will then also allow you more flexibility, e.g. selecting in the second level of columns and so on:
>>> df.xs(pd.Timestamp('2020-06-29 00:00:00'), axis='columns', level=0)
last_sales sales difference
group category
A Amazon 195.0 1268.850 0.0
Apple 61.0 18274.385 0.0
Facebook 106.0 19722.650 0.0
Google 61.0 55547.255 0.0
Netflix 37.0 15323.800 0.0
Tesla 13.0 1688.675 0.0
Total 954.0 227463.230 0.0
Uber 4.0 1906.000 0.0
total 477.0 113731.615 0.0
B Amazon 50.0 3219.650 0.0
Apple 50.0 15852.060 0.0
Facebook 75.0 17743.700 0.0
Google 43.0 37795.150 0.0
Netflix 17.0 5918.500 0.0
Tesla 14.0 1708.750 0.0
Total 504.0 166349.640 0.0
Uber 3.0 937.010 0.0
total 252.0 83174.820 0.0
all Total 2916.0 787625.740 0.0
total 2916.0 787625.740 0.0
You can then add .drop(index=[('all', 'total')]) to remove the second total line, and possible .reset_index()
The way to do it with .loc[] is to provide a tuple, with the first item being a Timestamp object and the second an empty slice. However this will keep the 2 levels of indexing, so it is not what you want:
>>> df.loc[:, (pd.Timestamp('2020-06-29 00:00:00'), slice(None))].head(2)
2020-06-29 00:00:00
last_sales sales difference
group category
A Amazon 195.0 1268.850 0.0
Apple 61.0 18274.385 0.0

Conditionally summing multiple columns

I would like to sum certain rows based on a condition in a different row.
So I have a columns for points
{'secondBoxer1': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'secondBoxer2': {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
'secondBoxer3': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'secondBoxer4': {0: 15.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'secondBoxer5': {0: 15.0, 1: 53.57142857142857, 2: 0.0, 3: 0.0, 4: 0.0},
'secondBoxer6': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'secondBoxer7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 50.0, 4: 0.0},
'secondBoxer8': {0: 0.0, 1: 0.0, 2: 0.0, 3: 37.142857142857146, 4: 0.0}}
and a column with the outcome of each fight
{'outcome1': {0: 'win ', 1: 'win ', 2: 'win ', 3: 'draw ', 4: 'win '},
'outcome2': {0: 'win ', 1: 'win ', 2: 'win ', 3: 'win ', 4: 'win '},
'outcome3': {0: 'win ', 1: 'win ', 2: 'win ', 3: 'win ', 4: 'scheduled '},
'outcome4': {0: 'win ', 1: 'win ', 2: 'nan', 3: 'loss ', 4: 'nan'},
'outcome5': {0: 'win ', 1: 'draw ', 2: 'nan', 3: 'win ', 4: 'nan'},
'outcome6': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'loss ', 4: 'nan'},
'outcome7': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'loss ', 4: 'nan'},
'outcome8': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'win ', 4: 'nan'}}
I would like to sum the points in the first columns (points columns) in cases where the outcome is equals to a win.
I have written this code, where opp_names is the list of columns with the points and outcome_cols is a list of columns with the outcomes
data[opp_names].sum(axis=1).where(data[outcome_cols] == 'win')
The problem with the output from this code is that it returns a total sum of points that is not conditional
In your case we use mask :d is your first dict , d1 is your 2nd dict
pd.DataFrame(d).mask(pd.DataFrame(d1).ne('win ').to_numpy()).sum(1)
Out[164]:
0 30.000000
1 0.000000
2 10.000000
3 37.142857
4 0.000000
dtype: float64

Number of labels does not match samples on decision tree regression

Trying to run a decision tree regressor on my data, but whenever I try and run my code, I get this error
ValueError: Number of labels=78177 does not match number of samples=312706
#feature selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
target = ['sale_price']
train, test = train_test_split(housing_data, test_size=0.2)
regression_tree = DecisionTreeRegressor(criterion="entropy",random_state=100,
max_depth=4,min_samples_leaf=5)
regression_tree.fit(train,test)
I have added a sample of my code, hopefully this gives you more context to help better understand my question and problem:
{'Age of House at Sale': {0: 6,
1: 2016,
2: 92,
3: 42,
4: 90,
5: 2012,
6: 89,
7: 3,
8: 2015,
9: 104},
'AreaSource': {0: 2.0,
1: 7.0,
2: 2.0,
3: 2.0,
4: 2.0,
5: 2.0,
6: 2.0,
7: 2.0,
8: 2.0,
9: 2.0},
'AssessLand': {0: 9900.0,
1: 1571850.0,
2: 1548000.0,
3: 36532350.0,
4: 2250000.0,
5: 3110400.0,
6: 2448000.0,
7: 1354500.0,
8: 1699200.0,
9: 1282500.0},
'AssessTot': {0: 34380.0,
1: 1571850.0,
2: 25463250.0,
3: 149792400.0,
4: 27166050.0,
5: 5579990.0,
6: 28309500.0,
7: 23965650.0,
8: 3534300.0,
9: 11295000.0},
'BldgArea': {0: 2688.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 382746.0,
6: 290440.0,
7: 241764.0,
8: 463427.0,
9: 547000.0},
'BldgClass': {0: 72,
1: 89,
2: 80,
3: 157,
4: 150,
5: 44,
6: 92,
7: 43,
8: 39,
9: 61},
'BldgDepth': {0: 50.0,
1: 0.0,
2: 92.0,
3: 0.0,
4: 100.33,
5: 315.0,
6: 125.0,
7: 100.0,
8: 0.0,
9: 80.92},
'BldgFront': {0: 20.0,
1: 0.0,
2: 335.0,
3: 0.0,
4: 202.0,
5: 179.0,
6: 92.0,
7: 500.0,
8: 0.0,
9: 304.0},
'BsmtCode': {0: 5.0,
1: 5.0,
2: 5.0,
3: 5.0,
4: 2.0,
5: 5.0,
6: 2.0,
7: 2.0,
8: 5.0,
9: 5.0},
'CD': {0: 310.0,
1: 302.0,
2: 302.0,
3: 318.0,
4: 302.0,
5: 301.0,
6: 302.0,
7: 301.0,
8: 301.0,
9: 302.0},
'ComArea': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 30000.0,
5: 11200.0,
6: 290440.0,
7: 27900.0,
8: 4884.0,
9: 547000.0},
'CommFAR': {0: 0.0,
1: 2.0,
2: 2.0,
3: 2.0,
4: 0.0,
5: 0.0,
6: 10.0,
7: 2.0,
8: 0.0,
9: 2.0},
'Council': {0: 41.0,
1: 33.0,
2: 33.0,
3: 46.0,
4: 33.0,
5: 33.0,
6: 33.0,
7: 33.0,
8: 33.0,
9: 35.0},
'Easements': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ExemptLand': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 2250000.0,
5: 0.0,
6: 0.0,
7: 932847.0,
8: 0.0,
9: 0.0},
'ExemptTot': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 27166050.0,
5: 0.0,
6: 11304900.0,
7: 23543997.0,
8: 0.0,
9: 0.0},
'FacilFAR': {0: 0.0,
1: 6.5,
2: 0.0,
3: 0.0,
4: 4.8,
5: 4.8,
6: 10.0,
7: 3.0,
8: 5.0,
9: 4.8},
'FactryArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 547000.0},
'GarageArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1285000.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 22200.0,
8: 0.0,
9: 0.0},
'HealthArea': {0: 6410.0,
1: 1000.0,
2: 2300.0,
3: 8822.0,
4: 2300.0,
5: 400.0,
6: 2300.0,
7: 700.0,
8: 500.0,
9: 9300.0},
'HealthCent': {0: 35.0,
1: 36.0,
2: 38.0,
3: 35.0,
4: 38.0,
5: 30.0,
6: 38.0,
7: 30.0,
8: 30.0,
9: 36.0},
'IrrLotCode': {0: 1, 1: 1, 2: 0, 3: 0, 4: 1, 5: 1, 6: 0, 7: 1, 8: 0, 9: 0},
'LandUse': {0: 2.0,
1: 10.0,
2: 5.0,
3: 5.0,
4: 8.0,
5: 4.0,
6: 5.0,
7: 3.0,
8: 3.0,
9: 6.0},
'LotArea': {0: 2252.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'LotDepth': {0: 100.0,
1: 275.33,
2: 335.92,
3: 859.0,
4: 100.33,
5: 320.0,
6: 125.0,
7: 200.0,
8: 281.86,
9: 204.0},
'LotFront': {0: 24.0,
1: 490.5,
2: 92.42,
3: 930.0,
4: 202.0,
5: 180.0,
6: 100.0,
7: 521.25,
8: 225.08,
9: 569.0},
'LotType': {0: 5.0,
1: 5.0,
2: 3.0,
3: 3.0,
4: 3.0,
5: 3.0,
6: 3.0,
7: 1.0,
8: 5.0,
9: 3.0},
'NumBldgs': {0: 1.0,
1: 0.0,
2: 1.0,
3: 4.0,
4: 1.0,
5: 1.0,
6: 1.0,
7: 1.0,
8: 2.0,
9: 13.0},
'NumFloors': {0: 2.0,
1: 0.0,
2: 13.0,
3: 2.0,
4: 15.0,
5: 0.0,
6: 37.0,
7: 6.0,
8: 20.0,
9: 8.0},
'OfficeArea': {0: 0.0,
1: 0.0,
2: 264750.0,
3: 0.0,
4: 30000.0,
5: 1822.0,
6: 274500.0,
7: 4200.0,
8: 0.0,
9: 0.0},
'OtherArea': {0: 0.0,
1: 0.0,
2: 39900.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'PolicePrct': {0: 70.0,
1: 84.0,
2: 84.0,
3: 63.0,
4: 84.0,
5: 90.0,
6: 84.0,
7: 94.0,
8: 90.0,
9: 88.0},
'ProxCode': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1.0,
8: 0.0,
9: 0.0},
'ResArea': {0: 2172.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 371546.0,
6: 0.0,
7: 213864.0,
8: 458543.0,
9: 0.0},
'ResidFAR': {0: 2.0,
1: 7.2,
2: 0.0,
3: 0.0,
4: 2.43,
5: 2.43,
6: 10.0,
7: 3.0,
8: 5.0,
9: 0.0},
'RetailArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1263000.0,
4: 0.0,
5: 9378.0,
6: 15940.0,
7: 0.0,
8: 4884.0,
9: 0.0},
'SHAPE_Area': {0: 2316.8863224,
1: 140131.577176,
2: 34656.4472405,
3: 797554.847834,
4: 21360.1476315,
5: 58564.8643115,
6: 12947.145471,
7: 50772.624868800005,
8: 47019.5677861,
9: 118754.78573699998},
'SHAPE_Leng': {0: 249.41135038849998,
1: 1559.88914353,
2: 890.718521021,
3: 3729.78685686,
4: 620.761169374,
5: 1006.33799946,
6: 460.03168012300006,
7: 1385.27352839,
8: 992.915660585,
9: 1565.91477261},
'SanitDistr': {0: 10.0,
1: 2.0,
2: 2.0,
3: 18.0,
4: 2.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'SanitSub': {0: 21,
1: 23,
2: 31,
3: 22,
4: 31,
5: 21,
6: 23,
7: 7,
8: 12,
9: 22},
'SchoolDist': {0: 19.0,
1: 13.0,
2: 13.0,
3: 22.0,
4: 13.0,
5: 14.0,
6: 13.0,
7: 14.0,
8: 14.0,
9: 14.0},
'SplitZone': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1},
'StrgeArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1500.0,
8: 0.0,
9: 0.0},
'UnitsRes': {0: 2.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 522.0,
6: 0.0,
7: 234.0,
8: 470.0,
9: 0.0},
'UnitsTotal': {0: 2.0,
1: 0.0,
2: 0.0,
3: 123.0,
4: 1.0,
5: 525.0,
6: 102.0,
7: 237.0,
8: 472.0,
9: 1.0},
'YearAlter1': {0: 0.0,
1: 0.0,
2: 1980.0,
3: 0.0,
4: 1998.0,
5: 0.0,
6: 2009.0,
7: 2012.0,
8: 0.0,
9: 0.0},
'YearAlter2': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 2000.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ZipCode': {0: 11220.0,
1: 11201.0,
2: 11201.0,
3: 11234.0,
4: 11201.0,
5: 11249.0,
6: 11241.0,
7: 11211.0,
8: 11249.0,
9: 11205.0},
'ZoneDist1': {0: 24,
1: 76,
2: 5,
3: 64,
4: 24,
5: 24,
6: 30,
7: 74,
8: 45,
9: 27},
'ZoneMap': {0: 3,
1: 19,
2: 19,
3: 22,
4: 19,
5: 19,
6: 19,
7: 2,
8: 19,
9: 19},
'building_class': {0: 141,
1: 97,
2: 87,
3: 176,
4: 168,
5: 8,
6: 102,
7: 46,
8: 97,
9: 66},
'building_class_at_sale': {0: 143,
1: 98,
2: 89,
3: 179,
4: 171,
5: 7,
6: 103,
7: 49,
8: 98,
9: 69},
'building_class_category': {0: 39,
1: 71,
2: 31,
3: 38,
4: 86,
5: 40,
6: 80,
7: 75,
8: 71,
9: 41},
'commercial_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 3,
8: 0,
9: 1},
'gross_sqft': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 0.0,
6: 290440.0,
7: 241764.0,
8: 0.0,
9: 547000.0},
'land_sqft': {0: 0.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'neighborhood': {0: 43,
1: 48,
2: 6,
3: 44,
4: 6,
5: 40,
6: 6,
7: 28,
8: 40,
9: 56},
'residential_units': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 234,
8: 0,
9: 0},
'sale_date': {0: 2257,
1: 4839,
2: 337,
3: 638,
4: 27,
5: 1458,
6: 2450,
7: 3276,
8: 5082,
9: 1835},
'sale_price': {0: 499401179.0,
1: 345000000.0,
2: 340000000.0,
3: 276947000.0,
4: 202500000.0,
5: 185445000.0,
6: 171000000.0,
7: 169000000.0,
8: 165000000.0,
9: 161000000.0},
'tax_class': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 7, 8: 3, 9: 3},
'total_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 237,
8: 0,
9: 1},
'zip_code': {0: 11201,
1: 11201,
2: 11201,
3: 11234,
4: 11201,
5: 11249,
6: 11241,
7: 11211,
8: 11249,
9: 11205}}

pandas: columns must be same length as key

I'm trying to re-format several columns into strings (they contain NaNs, so I can't just read them in as integers). All of the columns are currently float64, and I want to make it so they don't have decimals.
Here is the data:
{'crash_id': {0: 201226857.0,
1: 201226857.0,
2: 2012272611.0,
3: 2012272611.0,
4: 2012298998.0},
'driver_action1': {0: 1.0, 1: 1.0, 2: 29.0, 3: 1.0, 4: 3.0},
'driver_action2': {0: 99.0, 1: 99.0, 2: 1.0, 3: 99.0, 4: 99.0},
'driver_action3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'driver_action4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event1': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'harmful_event2': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'most_damaged_area': {0: 14.0, 1: 2.0, 2: 14.0, 3: 14.0, 4: 3.0},
'most_harmful_event': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'point_of_impact': {0: 15.0, 1: 1.0, 2: 14.0, 3: 14.0, 4: 1.0},
'vehicle_id': {0: 20121.0, 1: 20122.0, 2: 20123.0, 3: 20124.0, 4: 20125.0},
'vehicle_maneuver': {0: 3.0, 1: 1.0, 2: 4.0, 3: 1.0, 4: 1.0}}
When I try to convert those columns to string, this is what happens:
>> df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']] = df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']].applymap(lambda x: '{:.0f}'.format(x))
File "C:\Users\<name>\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2376, in _setitem_array
raise ValueError('Columns must be same length as key')
ValueError: Columns must be same length as key
I've never seen this error before and feel like this is something simple...what am I doing wrong?
Your code runs for me with the dictionary you provided. Try creating a function to deal with the NaN cases separately; I think they are causing your issues.
Something basic like below:
def formatter(x):
if x == None:
return None
else:
return '{:.0f}'.format(x)

Categories