so I'm trying to apply different conditions that depends on a date, months to be specific. For example, for January replace the data in TEMP that is above 45 but for February that is above 30 and so on. I already did that with the code below, but the problem is that the data from the previous month is replace it with nan.
This is my code:
meses = ["01", "02"]
for i in var_vars:
if i in dataframes2.columns.values:
for j in range(len(meses)):
test_prueba_mes = dataframes2[i].loc[dataframes2['fecha'].dt.month == int(meses[j])]
test_prueba = test_prueba_mes[dataframes2[i]<dataframes.loc[i]["X"+meses[j]+".max"]]
dataframes2["Prueba " + str(i)] = test_prueba
Output:
dataframes2.tail(5)
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 NaN NaN
23 2020-01-02 00:00:00 31.0 3 NaN NaN
24 2020-01-02 12:00:00 31.0 2 NaN NaN
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
My desired Output is:
Output:
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 29.0 NaN
23 2020-01-02 00:00:00 31.0 3 31.0 3.0
24 2020-01-02 12:00:00 31.0 2 31.0 2.0
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
Appreciate if anyone can help me.
Update: The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
An example of the data:
fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20
With clear understanding
have encoded monthly limits into a dict limits
use numpy select(), when a condition matches take value corresponding to condition from second parameter. Default to third parameter
build conditions dynamically from limits dict
second parameter needs to be same length as conditions list. Build list of np.nan as list comprehension so it's correct length
to consider all columns, use a dict comprehension that builds **kwarg params to assign()
df = pd.read_csv(io.StringIO(""" fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20"""), sep="\s\s+", engine="python")
df.fecha = pd.to_datetime(df.fecha)
# The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
limits = {1:45, 2:30, 3:45, 4:10, 5:15, 6:30}
df = df.assign(**{f"Prueba {c}":np.select( # construct target column name
# build a condition for each of the month limits
[df.fecha.dt.month.eq(m) & df[c].gt(l) for m,l in limits.items()],
[np.nan for m in limits.keys()], # NaN if beyond limit
df[c]) # keep value if within limits
for c in df.columns if "Avg" in c}) # do calc for all columns that have "Avg" in name
fecha
TEMP_C_Avg
RH_Avg
Prueba TEMP_C_Avg
Prueba RH_Avg
25
2020-02-01 10:00:00
29
5
29
5
26
2020-02-01 11:00:00
32
105
nan
nan
27
2020-03-01 10:00:00
55
3
nan
3
28
2020-03-01 11:00:00
40
5
40
5
29
2020-04-01 10:00:00
10
20
10
nan
30
2020-04-01 11:00:00
5
15
5
nan
31
2020-05-01 10:00:00
20
15
nan
15
32
2020-05-01 11:00:00
5
106
5
nan
33
2020-06-01 10:00:00
33
107
nan
nan
34
2020-06-01 11:00:00
20
20
20
20
Related
I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN
Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06
If I had some random data created on a one hour sample..
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can use the groupby to break out each day:
for idx, day in df.groupby(df.index.date):
print(day)
Now is there a way to calculate the time difference between daily min & max value based on the timestamp in hours? for each day record the daily min & max & time difference?
After some discussion (thanks #Erfan):
(df.Random_Number
.groupby(df.index.date)
.agg(['idxmin','idxmax'])
.diff(axis=1).iloc[:,1]
.div(pd.to_timedelta('1H'))
)
Output:
2018-10-09 -4.0
2018-10-10 -1.0
2018-10-11 -4.0
2018-10-12 12.0
2018-10-13 21.0
2018-10-14 6.0
2018-10-15 -6.0
2018-10-16 -18.0
2018-10-17 -8.0
2018-10-18 9.0
2018-10-19 -10.0
2018-10-20 3.0
2018-10-21 10.0
2018-10-22 2.0
2018-10-23 9.0
2018-10-24 2.0
2018-10-25 3.0
2018-10-26 2.0
2018-10-27 -22.0
2018-10-28 6.0
2018-10-29 -8.0
2018-10-30 -1.0
2018-10-31 -11.0
2018-11-01 19.0
2018-11-02 7.0
2018-11-03 4.0
2018-11-04 18.0
2018-11-05 -1.0
2018-11-06 15.0
2018-11-07 -14.0
2018-11-08 -16.0
2018-11-09 -2.0
2018-11-10 -7.0
2018-11-11 -14.0
2018-11-12 12.0
2018-11-13 -14.0
2018-11-14 2.0
2018-11-15 2.0
2018-11-16 6.0
2018-11-17 -7.0
2018-11-18 5.0
2018-11-19 9.0
Name: idxmax, dtype: float64
Alternatively, should you want to retain all columns with data frame output, consider merging on an aggregated dataset:
# ADJUST FOR DATETIME AND DATE AS COLUMNS
df = (df.reset_index()
.assign(date = df.index.date)
)
# AGGREGATION + MERGE ON MIN/MAX + CALCULATION
agg_df = (df.groupby('date')['Random_Number']
.agg(["min", "max"])
.reset_index()
.merge(df, left_on=['date', 'max'], right_on=['date', 'Random_Number'])
.merge(df, left_on=['date', 'min'], right_on=['date', 'Random_Number'],
suffixes=['', '_min'])
.assign(diff = lambda x: (x['index'] - x['index_min']) / pd.to_timedelta('1H'))
)
print(agg_df.head(10))
# date min max index Random_Number index_min Random_Number_min diff
# 0 2018-10-09 1 94 2018-10-09 05:00:00 94 2018-10-09 09:00:00 1 -4.0
# 1 2018-10-10 12 95 2018-10-10 20:00:00 95 2018-10-10 21:00:00 12 -1.0
# 2 2018-10-11 5 97 2018-10-11 15:00:00 97 2018-10-11 19:00:00 5 -4.0
# 3 2018-10-12 7 98 2018-10-12 18:00:00 98 2018-10-12 06:00:00 7 12.0
# 4 2018-10-13 1 91 2018-10-13 22:00:00 91 2018-10-13 01:00:00 1 21.0
# 5 2018-10-14 1 97 2018-10-14 10:00:00 97 2018-10-14 04:00:00 1 6.0
# 6 2018-10-15 9 97 2018-10-15 06:00:00 97 2018-10-15 12:00:00 9 -6.0
# 7 2018-10-16 3 95 2018-10-16 04:00:00 95 2018-10-16 22:00:00 3 -18.0
# 8 2018-10-17 2 95 2018-10-17 13:00:00 95 2018-10-17 21:00:00 2 -8.0
# 9 2018-10-18 1 91 2018-10-18 21:00:00 91 2018-10-18 12:00:00 1 9.0
I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']
I am trying to calculate the total cost of staffing requirements over a day. My attempt is to group People required throughout the day and multiply the cost. I then try to group this cost per/hour. But my output isn't correct.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dates
d = ({
'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
df = pd.DataFrame(data = d)
df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')
formatter = dates.DateFormatter('%Y-%m-%d %H:%M:%S')
df = df.groupby(pd.Grouper(freq='15T',key='Time'))['People'].max().ffill()
df = df.reset_index(level=['Time'])
df['Cost'] = df['People'] * 26
cost = df.groupby([df['Time'].dt.hour])['Cost'].sum()
#For reference. This plot displays people required throughout the day
fig, ax = plt.subplots(figsize = (10,5))
plt.plot(df['Time'], df['People'], color = 'blue')
plt.locator_params(axis='y', nbins=6)
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))
plt.ylabel('People Required', labelpad = 10)
plt.xlabel('Time', labelpad = 10)
print(cost)
Out:
0 416.0
1 416.0
2 416.0
3 130.0
8 104.0
9 104.0
10 208.0
11 208.0
12 260.0
13 312.0
14 312.0
15 312.0
16 312.0
17 312.0
18 364.0
19 312.0
20 312.0
21 312.0
22 416.0
23 416.0
I have done the calculations manually an the total cost output should be:
$1456
I think the wrong numbers in your question is most likely caused by the incorrect datetime values that you have. Once you have fixed that, you should get the correct numbers. Here's an attempt from my end, with a little tweak to the Time column.
import pandas as pd
df = pd.DataFrame({
'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:00:00','1/1/1900 12:29:00','1/1/1900 12:30:00','1/1/1900 13:00:00','1/1/1900 13:02:00','1/1/1900 13:15:00','1/1/1900 13:20:00','1/1/1900 18:10:00','1/1/1900 18:15:00','1/1/1900 18:20:00','1/1/1900 18:25:00','1/1/1900 18:45:00','1/1/1900 18:50:00','1/1/1900 19:05:00','1/1/1900 19:07:00','1/1/1900 21:57:00','1/1/1900 22:00:00','1/1/1900 22:30:00','1/1/1900 22:35:00','1/2/1900 3:00:00','1/2/1900 3:05:00','1/2/1900 3:20:00','1/2/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
>>>df
Time People
0 1/1/1900 8:00:00 1
1 1/1/1900 9:59:00 1
2 1/1/1900 10:00:00 2
3 1/1/1900 12:29:00 2
4 1/1/1900 12:30:00 3
5 1/1/1900 13:00:00 3
6 1/1/1900 13:02:00 2
7 1/1/1900 13:15:00 2
8 1/1/1900 13:20:00 3
9 1/1/1900 18:10:00 3
10 1/1/1900 18:15:00 4
11 1/1/1900 18:20:00 4
12 1/1/1900 18:25:00 3
13 1/1/1900 18:45:00 3
14 1/1/1900 18:50:00 2
15 1/1/1900 19:05:00 2
16 1/1/1900 19:07:00 3
17 1/1/1900 21:57:00 3
18 1/1/1900 22:00:00 4
19 1/1/1900 22:30:00 4
20 1/1/1900 22:35:00 3
21 1/2/1900 3:00:00 3
22 1/2/1900 3:05:00 2
23 1/2/1900 3:20:00 2
24 1/2/1900 3:25:00 1
df.Time = pd.to_datetime(df.Time)
df.Time.set_index('Time', inplace=True)
df_group = df.resample('15T').max().ffill()
df_hour = df_group.resample('1h').max()
df_hour['Cost'] = df_hour['People'] * 26
>>>df_hour
People Cost
Time
1900-01-01 08:00:00 1.0 26.0
1900-01-01 09:00:00 1.0 26.0
1900-01-01 10:00:00 2.0 52.0
1900-01-01 11:00:00 2.0 52.0
1900-01-01 12:00:00 3.0 78.0
1900-01-01 13:00:00 3.0 78.0
1900-01-01 14:00:00 3.0 78.0
1900-01-01 15:00:00 3.0 78.0
1900-01-01 16:00:00 3.0 78.0
1900-01-01 17:00:00 3.0 78.0
1900-01-01 18:00:00 4.0 104.0
1900-01-01 19:00:00 3.0 78.0
1900-01-01 20:00:00 3.0 78.0
1900-01-01 21:00:00 3.0 78.0
1900-01-01 22:00:00 4.0 104.0
1900-01-01 23:00:00 4.0 104.0
1900-01-02 00:00:00 4.0 104.0
1900-01-02 01:00:00 4.0 104.0
1900-01-02 02:00:00 4.0 104.0
1900-01-02 03:00:00 3.0 78.0
>>>df_hour.sum()
People 60.0
Cost 1560.0
dtype: float64
Edit: Took me reading the second time to realize the methodology that you're using. The incorrect number that you got is likely due to grouping by sum() after you performed a ffill() on your aggregated People column. Since ffill() fills the holes from the last valid value, you actually overestimated your cost for these periods. You should be using max() again, to find the maximum number of headcount required for that hour.
I have a dataframe which has 4 columns: day, time, tmin and tmax. tmin shows the temperature_min of the day and tmax shows the temperature_max.
What I want is to be able to fill all of the NaN values of one day with tmin and tmax of that day. For example I want to convert this dataframe:
day time tmin tmax
0 01 00:00:00 NaN NaN
1 01 03:00:00 -6.8 NaN
2 01 06:00:00 NaN NaN
3 01 09:00:00 NaN NaN
4 01 12:00:00 NaN NaN
5 01 15:00:00 NaN 1.2
6 01 18:00:00 NaN NaN
7 01 21:00:00 NaN NaN
8 02 00:00:00 NaN NaN
9 02 03:00:00 -7.2 NaN
10 02 06:00:00 NaN NaN
11 02 09:00:00 NaN NaN
12 02 12:00:00 NaN NaN
13 02 15:00:00 NaN 1.8
14 02 18:00:00 NaN NaN
15 02 21:00:00 NaN NaN
to this dataframe:
day time tmin tmax
0 01 00:00:00 -6.8 1.2
1 01 03:00:00 -6.8 1.2
2 01 06:00:00 -6.8 1.2
3 01 09:00:00 -6.8 1.2
4 01 12:00:00 -6.8 1.2
5 01 15:00:00 -6.8 1.2
6 01 18:00:00 -6.8 1.2
7 01 21:00:00 -6.8 1.2
8 02 00:00:00 -7.2 1.8
9 02 03:00:00 -7.2 1.8
10 02 06:00:00 -7.2 1.8
11 02 09:00:00 -7.2 1.8
12 02 12:00:00 -7.2 1.8
13 02 15:00:00 -7.2 1.8
14 02 18:00:00 -7.2 1.8
15 02 21:00:00 -7.2 1.8
Using groupby and transform:
df.assign(**df.groupby('day')[['tmin', 'tmax']].transform('first'))
day time tmin tmax
0 1 00:00:00 -6.8 1.2
1 1 03:00:00 -6.8 1.2
2 1 06:00:00 -6.8 1.2
3 1 09:00:00 -6.8 1.2
4 1 12:00:00 -6.8 1.2
5 1 15:00:00 -6.8 1.2
6 1 18:00:00 -6.8 1.2
7 1 21:00:00 -6.8 1.2
8 2 00:00:00 -7.2 1.8
9 2 03:00:00 -7.2 1.8
10 2 06:00:00 -7.2 1.8
11 2 09:00:00 -7.2 1.8
12 2 12:00:00 -7.2 1.8
13 2 15:00:00 -7.2 1.8
14 2 18:00:00 -7.2 1.8
15 2 21:00:00 -7.2 1.8
Or, if you'd like to modify the original DataFrame instead of returning a copy:
df[['tmin', 'tmax']] = df.groupby('day')[['tmin', 'tmax']].transform('first')
If you want to do it not as neatly as #user3483203 has done!
import pandas as pd
myfile = pd.read_csv('temperature.txt', sep=' ')
mydata = pd.DataFrame(data = myfile)
for i in mydata['day']:
row_start = (i-1) * 8 # assuming 8 data points per day
row_end = (i) * 8
mydata['tmin'][row_start:row_end] = pd.DataFrame.min(tempdata['tmin'][row_start:row_end], skipna=True)
mydata['tmax'][row_start:row_end] = pd.DataFrame.max(tempdata['tmax'][row_start:row_end], skipna=True)
Since you did not post any code, here's a general solution:
Step 1: Create variables that will keep track of the min and max temps
Step 2: Loop through each row in the frame
Step 3: For each row, check if the min or max == "NaN"
Step 4: If it is, replace with the value of the min or max variable we created earlier
just use the fillna with the forward fill and back fill parameters:
df.tmin = df.groupby('day')['tmin'].fillna(method='ffill').fillna(method='bfill')
df.tmax = df.groupby('day')['tmax'].fillna(method='ffill').fillna(method='bfill')