Related
This question already has an answer here:
Fill nan with zero python pandas
(1 answer)
Closed 12 months ago.
I want to turn the nan values into zeroes and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {4: 'DIS', 5: 'DKNG', 6: 'EXC'},
'Number of Buy s': {4: 1.0, 5: 2.0, 6: 1.0},
'Number of Cover s': {4: nan, 5: 2.0, 6: nan},
'Number of Sell s': {4: 1.0, 5: 1.0, 6: 1.0},
'Number of Short s': {4: nan, 5: 1.0, 6: nan},
'Gains/Losses': {4: -47.700000000000045, 5: -189.80000000000018, 6: 11.599999999999909},
'Percentage change': {4: -1.9691362018764154, 5: 1.380299604344981, 6: -2.006821924253117}})
Expected Output:
data = pd.DataFrame({'Symbol': {4: 'DIS', 5: 'DKNG', 6: 'EXC'},
'Number of Buy s': {4: 1.0, 5: 2.0, 6: 1.0},
'Number of Cover s': {4: 0, 5: 2.0, 6: 0},
'Number of Sell s': {4: 1.0, 5: 1.0, 6: 1.0},
'Number of Short s': {4: 0, 5: 1.0, 6: 0},
'Gains/Losses': {4: -47.700000000000045, 5: -189.80000000000018, 6: 11.599999999999909},
'Percentage change': {4: -1.9691362018764154, 5: 1.380299604344981, 6: -2.006821924253117}})
To replace all NaN values of a DataFrame with 0 -
df = df.fillna(0)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4))
df.iloc[1, 1] = np.nan
df.iloc[1, 2] = np.nan
df.iloc[1, 3] = np.nan
df.iloc[2, 0] = np.nan
df = df.fillna(0)
print(df)
I have this dataset, I'm trying to have a mean of "AC_POWER" every hour but isn't working properly. The dataset have 20-22 value every 15 minutes. I want to have something like this:
DATE AC_POWER
'15-05-2020 00:00' 400
'15-05-2020 01:00' 500
'15-05-2020 02:00' 500
'15-05-2020 03:00' 500
How to solve this?
import pandas as pd
df = pd.read_csv('dataset.csv')
df = df.reset_index()
df['DATE_TIME'] = df['DATE_TIME'].astype('datetime64[ns]')
df = df.resample('H', on='DATE_TIME').mean()
>>> df.head(10).to_dict()
{'AC_POWER': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'DAILY_YIELD': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'DATE_TIME': {0: '15-05-2020 00:00', 1: '15-05-2020 00:00', 2: '15-05-2020 00:00', 3: '15-05-2020 00:00',
4: '15-05-2020 00:00', 5: '15-05-2020 00:00', 6: '15-05-2020 00:00', 7: '15-05-2020 00:00',
8: '15-05-2020 00:00', 9: '15-05-2020 00:00'},
'DC_POWER': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'PLANT_ID': {0: 4135001, 1: 4135001, 2: 4135001, 3: 4135001, 4: 4135001, 5: 4135001,
6: 4135001, 7: 4135001, 8: 4135001, 9: 4135001},
'SOURCE_KEY': {0: '1BY6WEcLGh8j5v7', 1: '1IF53ai7Xc0U56Y', 2: '3PZuoBAID5Wc2HD', 3: '7JYdWkrLSPkdwr4',
4: 'McdE0feGgRqW7Ca', 5: 'VHMLBKoKgIrUVDU', 6: 'WRmjgnKYAwPKWDb', 7: 'ZnxXDlPa8U1GXgE',
8: 'ZoEaEvLYb1n2sOq', 9: 'adLQvlD726eNBSB'},
'TOTAL_YIELD': {0: 6259559.0, 1: 6183645.0, 2: 6987759.0, 3: 7602960.0, 4: 7158964.0,
5: 7206408.0, 6: 7028673.0, 7: 6522172.0, 8: 7098099.0, 9: 6271355.0}}
EDIT: I tried with a different dataset and the same code I've posted and it worked!
You need to set your date as an index first, the following does this and computes the mean for windows of 15 minutes:
df.set_index('DATE_TIME').resample('15T').mean()
Also, make sure your date vector is correctly formated.
I think you're looking for DataFrame.resample:
df.resample(rule='H', on='DATE_TIME')['AC_POWER'].mean()
Trying to resolve the error:
application.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
application.py:26: SettingWithCopyWarning:
but can't figure out why i'm getting this error and how to resolve it.
This is my code:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
#report month to datetime
sales['Report month'] = pd.to_datetime(sales['Report month'])
hr['Month'] = pd.to_datetime(hr['Month'])
#remove sales where customer churned
sales_clean = sales.loc[sales['Cod. Motivo Desconexion'] == 0]
sales_clean = sales_clean[['Report month','Rental Charge','ID Vendedor']]
sales_clean2 = pd.DataFrame(sales_clean.groupby(['Report month','ID Vendedor'])['Rental Charge'].sum())
sales_clean2.reset_index(inplace=True)
hr_area = hr.loc[hr['Area'] == 'Area 1']
merged_hr = hr_area.merge(sales_clean, left_on=['SalesSystemCode','Month'],right_on=['ID Vendedor','Report month'],how='left')
#creating new features: months of employment
merged_hr['MonthsofEmploymentRounded'] = round((merged_hr['Month'] - merged_hr['HireDate'])/np.timedelta64(1,'M')).astype('int')
#filters for interaction
YEAR_MONTH = merged_hr['Month'].unique()
#css stylesheet
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
#html layout
app.layout = html.Div(children=[
html.H1(children='SAC Challenge Level 2 Dashboard', style ={
'textAlign': 'center',
'height':'10'
}),
html.Div(children='''
Objective: Studying the impact of supervision on the performance of sales executives in Area 1
'''),
dcc.DatePickerRange(
id='year_month',
start_date= min(merged_hr['Month'].dt.date.tolist()),
end_date = 'Select date'
),
dcc.Graph(
id='performancetable'
)
])
#app.callback(dash.dependencies.Output('performancetable','figure'),
[dash.dependencies.Input('year_month', 'start_date'),
dash.dependencies.Input('year_month','end_date')])
def update_table(year_month):
if year_month is None or year_month ==[]:
year_month = YEAR_MONTH
performance = merged_hr[(merged_hr['Month'].isin(year_month))]
return {
'data': [
go.Table(
header = dict(values=list(performance.columns),fill_color='paleturquoise',align='left'),
cells = dict(values=[performance['Month'],performance['SalesSystemCode'],performance['TITULO'],
performance['HireDate'],performance['MonthsofEmploymentRounded'],performance['SupervisorEmployeeID'],
performance['BASE'],performance['carallowance'],performance['Commission_Target'],
performance['Fulfilment %'], performance['Commission Accrued'],performance['Commission paid'],
performance['Características (D)'],performance['Características (I)'],performance['Características (S)'],
performance['Características (C)'],performance['Motivación (D)'],performance['Motivación (I)'],
performance['Motivación (S)'],performance['Motivación (C)'],performance['Bajo Stress (D)'],
performance['Bajo Stress (I)'],performance['Bajo Stress (S)'],performance['Bajo Stress (C)'],
performance['Rental Charge']])
)],
}
if __name__ == '__main__':
app.run_server(debug=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is a sample of hr_data:
{'Month': {0: Timestamp('2017-12-01 00:00:00'),
1: Timestamp('2017-12-01 00:00:00'),
2: Timestamp('2017-12-01 00:00:00'),
3: Timestamp('2017-12-01 00:00:00'),
4: Timestamp('2017-12-01 00:00:00')},
'EmployeeID': {0: 91868, 1: 1812496, 2: 1812430, 3: 700915, 4: 1812581},
'PayrollProviderName': {0: 'Tele',
1: 'People',
2: 'People',
3: 'Stratego',
4: 'People'},
'SalesSystemCode': {0: 91868.0,
1: 802496.0,
2: 2430.0,
3: 700915.0,
4: 802581.0},
'Payroll Type': {0: 'Insourcing',
1: 'Third Party',
2: 'Third Party',
3: 'Third Party',
4: 'Third Party'},
'Name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'TITULO': {0: 'SALES SUPERVISOR',
1: 'SALES EXECUTIVE',
2: 'SALES EXECUTIVE',
3: 'SALES EXECUTIVE',
4: 'SALES EXECUTIVE'},
'Sexo': {0: 'M', 1: 'F', 2: 'F', 3: 'M', 4: 'F'},
'BirthDate': {0: Timestamp('1982-11-05 00:00:00'),
1: Timestamp('1987-09-24 00:00:00'),
2: Timestamp('1981-01-13 00:00:00'),
3: Timestamp('1986-04-18 00:00:00'),
4: Timestamp('1991-06-24 00:00:00')},
'HireDate': {0: Timestamp('2012-04-23 00:00:00'),
1: Timestamp('2017-04-10 00:00:00'),
2: Timestamp('2017-03-13 00:00:00'),
3: Timestamp('2015-01-22 00:00:00'),
4: Timestamp('2017-05-18 00:00:00')},
'SupervisorEmployeeID': {0: 7935, 1: 91868, 2: 91868, 3: 91868, 4: 91868},
'SupervisorName': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'BASE': {0: 895, 1: 700, 2: 700, 3: 700, 4: 700},
'carallowance': {0: 350, 1: 250, 2: 250, 3: 250, 4: 250},
'Commission_Target': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'Nacionalidad': {0: 'INT', 1: 'INT', 2: 'INT', 3: 'INT', 4: 'INT'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Comment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sales Quota (points)': {0: 1810.0, 1: 108.0, 2: 108.0, 3: 108.0, 4: 108.0},
'Real (points)': {0: 1855.0, 1: 86.0, 2: 245.0, 3: 149.0, 4: 91.0},
'Fulfilment %': {0: 1.0248618784530388,
1: 0.7962962962962963,
2: 2.2685185185185186,
3: 1.3796296296296295,
4: 0.8425925925925926},
'Commission Accrued': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'OA Commission Accrued': {0: 653.66,
1: 87.5,
2: 1494.79,
3: 794.79,
4: 160.42},
'Clawback': {0: 0.0, 1: 24.33, 2: 144.9, 3: 36.77, 4: 0.0},
'Other Commissions': {0: 0.0, 1: 0.0, 2: 9.16, 3: 9.16, 4: 0.0},
'Commission paid': {0: 1361.99, 1: 646.51, 2: 1942.38, 3: 1350.52, 4: 743.75},
'Exit Date': {0: NaT,
1: Timestamp('2018-04-13 00:00:00'),
2: NaT,
3: NaT,
4: Timestamp('2018-08-31 00:00:00')},
'Legal Motive': {0: nan,
1: 'Artículo No. 212',
2: nan,
3: nan,
4: 'Artículo No. 212'},
'Características (D)': {0: nan, 1: 70.0, 2: 70.0, 3: 60.0, 4: 67.0},
'Características (I)': {0: nan, 1: 95.0, 2: 62.0, 3: 25.0, 4: 15.0},
'Características (S)': {0: nan, 1: 20.0, 2: 48.0, 3: 75.0, 4: 40.0},
'Características (C)': {0: nan, 1: 25.0, 2: 34.0, 3: 85.0, 4: 94.0},
'Motivación (D)': {0: nan, 1: 85.0, 2: 75.0, 3: 40.0, 4: 59.0},
'Motivación (I)': {0: nan, 1: 95.0, 2: 74.0, 3: 74.0, 4: 25.0},
'Motivación (S)': {0: nan, 1: 11.0, 2: 58.0, 3: 65.0, 4: 65.0},
'Motivación (C)': {0: nan, 1: 7.0, 2: 33.0, 3: 84.0, 4: 93.0},
'Bajo Stress (D)': {0: nan, 1: 60.0, 2: 69.0, 3: 79.0, 4: 79.0},
'Bajo Stress (I)': {0: nan, 1: 86.0, 2: 60.0, 3: 6.0, 4: 18.0},
'Bajo Stress (S)': {0: nan, 1: 40.0, 2: 60.0, 3: 89.0, 4: 30.0},
'Bajo Stress (C)': {0: nan, 1: 60.0, 2: 48.0, 3: 84.0, 4: 92.0}}
sales_data:
{'Month': {0: Timestamp('2017-07-01 00:00:00'),
1: Timestamp('2017-07-01 00:00:00'),
2: Timestamp('2017-07-01 00:00:00'),
3: Timestamp('2017-07-01 00:00:00'),
4: Timestamp('2017-07-01 00:00:00')},
'Report month': {0: '2017-07',
1: '2017-07',
2: '2017-07',
3: '2017-07',
4: '2017-07'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Fecha de solicitud': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de salida': {0: Timestamp('2017-07-27 13:11:42'),
1: Timestamp('2017-07-27 15:08:39'),
2: Timestamp('2017-07-04 11:50:07'),
3: Timestamp('2017-07-07 16:40:44'),
4: Timestamp('2017-07-14 14:52:45')},
'Fecha de salida final': {0: Timestamp('2017-07-28 15:13:53'),
1: Timestamp('2017-07-27 15:46:16'),
2: Timestamp('2017-07-05 10:24:46'),
3: Timestamp('2017-07-08 08:36:43'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de proceso': {0: Timestamp('2017-08-01 00:00:00'),
1: Timestamp('2017-08-01 00:00:00'),
2: Timestamp('2017-08-01 00:00:00'),
3: Timestamp('2017-08-01 00:00:00'),
4: Timestamp('2017-08-01 00:00:00')},
'Fecha de sistema': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de completada': {0: Timestamp('2017-07-28 15:13:52'),
1: Timestamp('2017-07-27 15:46:15'),
2: Timestamp('2017-07-05 10:24:45'),
3: Timestamp('2017-07-08 08:36:42'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de creada': {0: Timestamp('2017-07-25 14:50:00'),
1: Timestamp('2017-07-25 14:56:00'),
2: Timestamp('2017-06-30 13:07:00'),
3: Timestamp('2017-07-03 18:25:00'),
4: Timestamp('2017-07-04 09:56:00')},
'Cod. de Distribucion': {0: 2302, 1: 2302, 2: 2302, 3: 91818, 4: 2302},
'Customer': {0: 19308378, 1: 19308378, 2: 27504455, 3: 27104497, 4: 17608676},
'Cod. Tipo Cliente': {0: 'R', 1: 'R', 2: 'R', 3: 'R', 4: 'R'},
'Tipo De Cliente': {0: 'Residencial ',
1: 'Residencial ',
2: 'Residencial ',
3: 'Residencial ',
4: 'Residencial '},
'Cuenta': {0: 193083780000,
1: 193083780000,
2: 275044550000,
3: 271044970000,
4: 176086760000},
'Status Cuenta': {0: 'W', 1: 'W', 2: 'W', 3: 'W', 4: 'W'},
'Tipo de Contabilidad': {0: 'RP', 1: 'RP', 2: 'RP', 3: 'RP', 4: 'RP'},
'Desc. Tipo Contabilidad': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tos Cat': {0: 'K', 1: 'K', 2: 'K', 3: 'K', 4: 'K'},
'Desc. Tos Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Mktg Cat': {0: 990005.0, 1: 990005.0, 2: 990000.0, 3: 990000.0, 4: 990000.0},
'Desc. Mktg Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Bill Sort': {0: 571.0, 1: 571.0, 2: 571.0, 3: 691.0, 4: 256.0},
'Orden de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Comando': {0: 'PMO', 1: 'PFB', 2: 'PMO', 3: 'PMO', 4: 'PMO'},
'Desc. Comando': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Prioridad': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5},
'Cod. Línea': {0: 3, 1: 2, 2: 1, 3: 1, 4: 1},
'Número de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Producto': {0: 1420, 1: 31000, 2: 1403, 3: 1404, 4: 1404},
'Desc. Producto': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sub Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Rental Charge': {0: 22.5,
1: 18.7125,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Inst Charge': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Control': {0: 'CONEXIONES_COMPLETADAS_CT',
1: 'CONEXIONES_COMPLETADAS_CT',
2: 'CONEXIONES_COMPLETADAS',
3: 'CONEXIONES_COMPLETADAS',
4: 'CONEXIONES_COMPLETADAS'},
'Cod. Estatus': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A'},
'Status': {0: 'Por Acción ',
1: 'Por Acción ',
2: 'Por Acción ',
3: 'Por Acción ',
4: 'Por Acción '},
'Cod Razon Pendiente': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: ' '},
'Razon Pendiente': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Motivo Desconexion': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Motivo Desconexion': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'ID Vendedor': {0: 2352.0, 1: 2352.0, 2: 2352.0, 3: 2352.0, 4: 2352.0},
'ID Oficinista': {0: 229113.0,
1: 229113.0,
2: 224666.0,
3: 221532.0,
4: 224666.0},
'ID Acct Manager': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Desc. Acct Manager': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Provincia': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'Central': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Chrg Prod Ant': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tipo Srv': {0: 'MO', 1: 'TI', 2: 'MO', 3: 'MO', 4: 'MO'},
'Tipo Srv Desc': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Diferencia ': {0: 2.5500000000000007,
1: 0.0,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Puntos ': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
#QuanHoang was pointing in the right direction with his comment, but you need to add .copy() for both the hr and sales dataframes:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']].copy()
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']].copy()
Using .copy() works because it creates a full copy of the data, rather than a view. Subsequent indexing operations work correctly on the copy.
Another option is to use .loc[] indexing when you do the selection from hr_data and sales_data. This should also work:
hr = hr_data.loc[:, ['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data.loc[:, ['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
Note that selecting columns with .loc[] uses the format df.loc[:, [ *columns* ] becasue .loc[] requires specifying the rows explicitly.
Using .loc[] works because .loc[] (and .iloc[]) indexing return a reference to the original dataframe, but with updated indexing behavior which is not subject to the 'setting with copy' problems.
Trying to run a decision tree regressor on my data, but whenever I try and run my code, I get this error
ValueError: Number of labels=78177 does not match number of samples=312706
#feature selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
target = ['sale_price']
train, test = train_test_split(housing_data, test_size=0.2)
regression_tree = DecisionTreeRegressor(criterion="entropy",random_state=100,
max_depth=4,min_samples_leaf=5)
regression_tree.fit(train,test)
I have added a sample of my code, hopefully this gives you more context to help better understand my question and problem:
{'Age of House at Sale': {0: 6,
1: 2016,
2: 92,
3: 42,
4: 90,
5: 2012,
6: 89,
7: 3,
8: 2015,
9: 104},
'AreaSource': {0: 2.0,
1: 7.0,
2: 2.0,
3: 2.0,
4: 2.0,
5: 2.0,
6: 2.0,
7: 2.0,
8: 2.0,
9: 2.0},
'AssessLand': {0: 9900.0,
1: 1571850.0,
2: 1548000.0,
3: 36532350.0,
4: 2250000.0,
5: 3110400.0,
6: 2448000.0,
7: 1354500.0,
8: 1699200.0,
9: 1282500.0},
'AssessTot': {0: 34380.0,
1: 1571850.0,
2: 25463250.0,
3: 149792400.0,
4: 27166050.0,
5: 5579990.0,
6: 28309500.0,
7: 23965650.0,
8: 3534300.0,
9: 11295000.0},
'BldgArea': {0: 2688.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 382746.0,
6: 290440.0,
7: 241764.0,
8: 463427.0,
9: 547000.0},
'BldgClass': {0: 72,
1: 89,
2: 80,
3: 157,
4: 150,
5: 44,
6: 92,
7: 43,
8: 39,
9: 61},
'BldgDepth': {0: 50.0,
1: 0.0,
2: 92.0,
3: 0.0,
4: 100.33,
5: 315.0,
6: 125.0,
7: 100.0,
8: 0.0,
9: 80.92},
'BldgFront': {0: 20.0,
1: 0.0,
2: 335.0,
3: 0.0,
4: 202.0,
5: 179.0,
6: 92.0,
7: 500.0,
8: 0.0,
9: 304.0},
'BsmtCode': {0: 5.0,
1: 5.0,
2: 5.0,
3: 5.0,
4: 2.0,
5: 5.0,
6: 2.0,
7: 2.0,
8: 5.0,
9: 5.0},
'CD': {0: 310.0,
1: 302.0,
2: 302.0,
3: 318.0,
4: 302.0,
5: 301.0,
6: 302.0,
7: 301.0,
8: 301.0,
9: 302.0},
'ComArea': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 30000.0,
5: 11200.0,
6: 290440.0,
7: 27900.0,
8: 4884.0,
9: 547000.0},
'CommFAR': {0: 0.0,
1: 2.0,
2: 2.0,
3: 2.0,
4: 0.0,
5: 0.0,
6: 10.0,
7: 2.0,
8: 0.0,
9: 2.0},
'Council': {0: 41.0,
1: 33.0,
2: 33.0,
3: 46.0,
4: 33.0,
5: 33.0,
6: 33.0,
7: 33.0,
8: 33.0,
9: 35.0},
'Easements': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ExemptLand': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 2250000.0,
5: 0.0,
6: 0.0,
7: 932847.0,
8: 0.0,
9: 0.0},
'ExemptTot': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 27166050.0,
5: 0.0,
6: 11304900.0,
7: 23543997.0,
8: 0.0,
9: 0.0},
'FacilFAR': {0: 0.0,
1: 6.5,
2: 0.0,
3: 0.0,
4: 4.8,
5: 4.8,
6: 10.0,
7: 3.0,
8: 5.0,
9: 4.8},
'FactryArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 547000.0},
'GarageArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1285000.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 22200.0,
8: 0.0,
9: 0.0},
'HealthArea': {0: 6410.0,
1: 1000.0,
2: 2300.0,
3: 8822.0,
4: 2300.0,
5: 400.0,
6: 2300.0,
7: 700.0,
8: 500.0,
9: 9300.0},
'HealthCent': {0: 35.0,
1: 36.0,
2: 38.0,
3: 35.0,
4: 38.0,
5: 30.0,
6: 38.0,
7: 30.0,
8: 30.0,
9: 36.0},
'IrrLotCode': {0: 1, 1: 1, 2: 0, 3: 0, 4: 1, 5: 1, 6: 0, 7: 1, 8: 0, 9: 0},
'LandUse': {0: 2.0,
1: 10.0,
2: 5.0,
3: 5.0,
4: 8.0,
5: 4.0,
6: 5.0,
7: 3.0,
8: 3.0,
9: 6.0},
'LotArea': {0: 2252.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'LotDepth': {0: 100.0,
1: 275.33,
2: 335.92,
3: 859.0,
4: 100.33,
5: 320.0,
6: 125.0,
7: 200.0,
8: 281.86,
9: 204.0},
'LotFront': {0: 24.0,
1: 490.5,
2: 92.42,
3: 930.0,
4: 202.0,
5: 180.0,
6: 100.0,
7: 521.25,
8: 225.08,
9: 569.0},
'LotType': {0: 5.0,
1: 5.0,
2: 3.0,
3: 3.0,
4: 3.0,
5: 3.0,
6: 3.0,
7: 1.0,
8: 5.0,
9: 3.0},
'NumBldgs': {0: 1.0,
1: 0.0,
2: 1.0,
3: 4.0,
4: 1.0,
5: 1.0,
6: 1.0,
7: 1.0,
8: 2.0,
9: 13.0},
'NumFloors': {0: 2.0,
1: 0.0,
2: 13.0,
3: 2.0,
4: 15.0,
5: 0.0,
6: 37.0,
7: 6.0,
8: 20.0,
9: 8.0},
'OfficeArea': {0: 0.0,
1: 0.0,
2: 264750.0,
3: 0.0,
4: 30000.0,
5: 1822.0,
6: 274500.0,
7: 4200.0,
8: 0.0,
9: 0.0},
'OtherArea': {0: 0.0,
1: 0.0,
2: 39900.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'PolicePrct': {0: 70.0,
1: 84.0,
2: 84.0,
3: 63.0,
4: 84.0,
5: 90.0,
6: 84.0,
7: 94.0,
8: 90.0,
9: 88.0},
'ProxCode': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1.0,
8: 0.0,
9: 0.0},
'ResArea': {0: 2172.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 371546.0,
6: 0.0,
7: 213864.0,
8: 458543.0,
9: 0.0},
'ResidFAR': {0: 2.0,
1: 7.2,
2: 0.0,
3: 0.0,
4: 2.43,
5: 2.43,
6: 10.0,
7: 3.0,
8: 5.0,
9: 0.0},
'RetailArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1263000.0,
4: 0.0,
5: 9378.0,
6: 15940.0,
7: 0.0,
8: 4884.0,
9: 0.0},
'SHAPE_Area': {0: 2316.8863224,
1: 140131.577176,
2: 34656.4472405,
3: 797554.847834,
4: 21360.1476315,
5: 58564.8643115,
6: 12947.145471,
7: 50772.624868800005,
8: 47019.5677861,
9: 118754.78573699998},
'SHAPE_Leng': {0: 249.41135038849998,
1: 1559.88914353,
2: 890.718521021,
3: 3729.78685686,
4: 620.761169374,
5: 1006.33799946,
6: 460.03168012300006,
7: 1385.27352839,
8: 992.915660585,
9: 1565.91477261},
'SanitDistr': {0: 10.0,
1: 2.0,
2: 2.0,
3: 18.0,
4: 2.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'SanitSub': {0: 21,
1: 23,
2: 31,
3: 22,
4: 31,
5: 21,
6: 23,
7: 7,
8: 12,
9: 22},
'SchoolDist': {0: 19.0,
1: 13.0,
2: 13.0,
3: 22.0,
4: 13.0,
5: 14.0,
6: 13.0,
7: 14.0,
8: 14.0,
9: 14.0},
'SplitZone': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1},
'StrgeArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1500.0,
8: 0.0,
9: 0.0},
'UnitsRes': {0: 2.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 522.0,
6: 0.0,
7: 234.0,
8: 470.0,
9: 0.0},
'UnitsTotal': {0: 2.0,
1: 0.0,
2: 0.0,
3: 123.0,
4: 1.0,
5: 525.0,
6: 102.0,
7: 237.0,
8: 472.0,
9: 1.0},
'YearAlter1': {0: 0.0,
1: 0.0,
2: 1980.0,
3: 0.0,
4: 1998.0,
5: 0.0,
6: 2009.0,
7: 2012.0,
8: 0.0,
9: 0.0},
'YearAlter2': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 2000.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ZipCode': {0: 11220.0,
1: 11201.0,
2: 11201.0,
3: 11234.0,
4: 11201.0,
5: 11249.0,
6: 11241.0,
7: 11211.0,
8: 11249.0,
9: 11205.0},
'ZoneDist1': {0: 24,
1: 76,
2: 5,
3: 64,
4: 24,
5: 24,
6: 30,
7: 74,
8: 45,
9: 27},
'ZoneMap': {0: 3,
1: 19,
2: 19,
3: 22,
4: 19,
5: 19,
6: 19,
7: 2,
8: 19,
9: 19},
'building_class': {0: 141,
1: 97,
2: 87,
3: 176,
4: 168,
5: 8,
6: 102,
7: 46,
8: 97,
9: 66},
'building_class_at_sale': {0: 143,
1: 98,
2: 89,
3: 179,
4: 171,
5: 7,
6: 103,
7: 49,
8: 98,
9: 69},
'building_class_category': {0: 39,
1: 71,
2: 31,
3: 38,
4: 86,
5: 40,
6: 80,
7: 75,
8: 71,
9: 41},
'commercial_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 3,
8: 0,
9: 1},
'gross_sqft': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 0.0,
6: 290440.0,
7: 241764.0,
8: 0.0,
9: 547000.0},
'land_sqft': {0: 0.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'neighborhood': {0: 43,
1: 48,
2: 6,
3: 44,
4: 6,
5: 40,
6: 6,
7: 28,
8: 40,
9: 56},
'residential_units': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 234,
8: 0,
9: 0},
'sale_date': {0: 2257,
1: 4839,
2: 337,
3: 638,
4: 27,
5: 1458,
6: 2450,
7: 3276,
8: 5082,
9: 1835},
'sale_price': {0: 499401179.0,
1: 345000000.0,
2: 340000000.0,
3: 276947000.0,
4: 202500000.0,
5: 185445000.0,
6: 171000000.0,
7: 169000000.0,
8: 165000000.0,
9: 161000000.0},
'tax_class': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 7, 8: 3, 9: 3},
'total_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 237,
8: 0,
9: 1},
'zip_code': {0: 11201,
1: 11201,
2: 11201,
3: 11234,
4: 11201,
5: 11249,
6: 11241,
7: 11211,
8: 11249,
9: 11205}}
I'm trying to re-format several columns into strings (they contain NaNs, so I can't just read them in as integers). All of the columns are currently float64, and I want to make it so they don't have decimals.
Here is the data:
{'crash_id': {0: 201226857.0,
1: 201226857.0,
2: 2012272611.0,
3: 2012272611.0,
4: 2012298998.0},
'driver_action1': {0: 1.0, 1: 1.0, 2: 29.0, 3: 1.0, 4: 3.0},
'driver_action2': {0: 99.0, 1: 99.0, 2: 1.0, 3: 99.0, 4: 99.0},
'driver_action3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'driver_action4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event1': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'harmful_event2': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'most_damaged_area': {0: 14.0, 1: 2.0, 2: 14.0, 3: 14.0, 4: 3.0},
'most_harmful_event': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'point_of_impact': {0: 15.0, 1: 1.0, 2: 14.0, 3: 14.0, 4: 1.0},
'vehicle_id': {0: 20121.0, 1: 20122.0, 2: 20123.0, 3: 20124.0, 4: 20125.0},
'vehicle_maneuver': {0: 3.0, 1: 1.0, 2: 4.0, 3: 1.0, 4: 1.0}}
When I try to convert those columns to string, this is what happens:
>> df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']] = df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']].applymap(lambda x: '{:.0f}'.format(x))
File "C:\Users\<name>\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2376, in _setitem_array
raise ValueError('Columns must be same length as key')
ValueError: Columns must be same length as key
I've never seen this error before and feel like this is something simple...what am I doing wrong?
Your code runs for me with the dictionary you provided. Try creating a function to deal with the NaN cases separately; I think they are causing your issues.
Something basic like below:
def formatter(x):
if x == None:
return None
else:
return '{:.0f}'.format(x)