apply lamba to df based on condition - python

If I can make up a df with some random data
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 24,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature1','Temperature2','Value'], index=tidx)
How could I use the lamba function to add 5000 to each row to the columns Temperature1 & Temperature2 only if the df index hour is less than 6?
If I use
for hour in df.index.hour:
if hour < 6:# and name contains 'Temperature'
df = df.apply(lambda x: x + 5000)
The above code isnt correct it adds 5000 to all rows in the df. Any tips greatly appreciated..

You can do with loc:
# get the temperature columns
temp_cols = [x for x in df.columns if 'Temperature' in x]
# update with loc access
df.loc[df.index.hour<6, temp_cols] += 5000
Output:
Temperature1 Temperature2 Value
2019-01-01 00:00:00 5000.180270 5000.019475 0.463219
2019-01-01 01:00:00 5000.724934 5000.420204 0.485427
2019-01-01 02:00:00 5000.012781 5000.487372 0.941807
2019-01-01 03:00:00 5000.850795 5000.729964 0.108736
2019-01-01 04:00:00 5000.893904 5000.857154 0.165087
2019-01-01 05:00:00 5000.632334 5000.020484 0.116737
2019-01-01 06:00:00 0.316367 0.157912 0.758980
2019-01-01 07:00:00 0.818275 0.344624 0.318799
2019-01-01 08:00:00 0.111661 0.083953 0.712726
2019-01-01 09:00:00 0.599543 0.055674 0.479797
2019-01-01 10:00:00 0.401676 0.847979 0.717849
2019-01-01 11:00:00 0.602064 0.552384 0.949102
2019-01-01 12:00:00 0.986673 0.338054 0.239875
2019-01-01 13:00:00 0.796436 0.063686 0.364616
2019-01-01 14:00:00 0.070023 0.319368 0.070383
2019-01-01 15:00:00 0.290264 0.790101 0.905400
2019-01-01 16:00:00 0.792621 0.561819 0.616018
2019-01-01 17:00:00 0.361484 0.168817 0.436241
2019-01-01 18:00:00 0.732825 0.062888 0.020733
2019-01-01 19:00:00 0.770548 0.299952 0.701164
2019-01-01 20:00:00 0.734668 0.932905 0.400328
2019-01-01 21:00:00 0.358438 0.806567 0.764491
2019-01-01 22:00:00 0.652615 0.810967 0.642215
2019-01-01 23:00:00 0.957444 0.333874 0.738253

Boolean select contains Temperature
m=df.columns.str.contains('Temperature')
m
Select rows with hour<6 and Update by;
df.loc[df.index.hour<6, m] += 5000
df

Related

Change value in pandas series based on hour of the day using df.apply and if statement

I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

datetime difference between dates

I have a df like so:
firstdate seconddate
0 2011-01-01 13:00:00 2011-01-01 13:00:00
1 2011-01-02 14:00:00 2011-01-01 11:00:00
2 2011-01-02 16:00:00 2011-01-02 13:00:00
3 2011-01-04 12:00:00 2011-01-03 15:00:00
...
Seconddate is always before firstdate. I want to compute the difference between firstdate and seconddate in number of days and make this a column, if firstdate and seconddate are the same day, difference=0, if seconddate is the day before firstdate, difference=1 and so on until a week. How would I do this?
df['first'] = pd.to_datetime(df['first'])
df['second'] = pd.to_datetime(df['second'])
df['diff'] = (df['first'] - df['second']).dt.days
This will add a column with the diff. You can delete based on it
df.drop(df[df.diff < 0].index)
# or
df = df[df.diff > 0]

Include one row in multiple groupby groups

I am grouping a time series by hour to perform an operation on each hour of data separately:
import pandas as pd
from datetime import datetime, timedelta
x = [2, 2, 4, 2, 2, 0]
idx = pd.date_range(
start=datetime(2019, 1, 1),
end=datetime(2019, 1, 1, 2, 30),
freq=timedelta(minutes=30),
)
s = pd.Series(x, index=idx)
hourly = s.groupby(lambda x: x.hour)
print(s)
print("summed:")
print(hourly.sum())
which produces:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Freq: 30T, dtype: int64
summed:
0 4
1 6
2 2
dtype: int64
As expected.
I now want to know the area under the time series per hour, for which I can use numpy.trapz:
import numpy as np
def series_trapz(series):
hours = [i.timestamp() / 3600 for i in series.index]
return np.trapz(series, x=hours)
print("Area under curve")
print(hourly.agg(series_trapz))
But for this to work correctly, the boundaries between the groups must appear in both groups!
For example, the first group must be:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
and the second group must be
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
etc.
Is this at all possible using pandas.groupby?
I don't think that I have your np.trapz logic completely correct here, but I think you can probably get what you want with .rolling(..., closed="both") so that the endpoints of the intervals are always included:
In [366]: s.rolling("1H", closed="both").apply(np.trapz).iloc[::2]
Out[366]:
2019-01-01 00:00:00 0.0
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
Freq: 60T, dtype: float64
I think you could repeat the limit of groups in your serie using Series.repeat:
r=(s.index.minute==0).astype(int)+1
new_s=s.repeat(r)
print(new_s)
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Then you could use Series.groupby:
groups=(new_s.index.to_series().shift(-1,fill_value=0).dt.minute!=0).cumsum()
for i,group in new_s.groupby(groups):
print(group)
print('-'*50)
Name: col1, dtype: int64
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Name: col1, dtype: int64
--------------------------------------------------
IIUC, this can be solved manually with rolling:
hours = np.unique(s.index.floor('H'))
# the answer:
(s.add(s.shift())
.mul(s.index.to_series()
.diff()
.dt.total_seconds()
.div(3600)
)
.rolling('1H').sum()[hours]
)
Output:
2019-01-01 00:00:00 NaN
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
dtype: float64

Python: Grouping by time interval

I have a dataframe that looks like this:
I'm using python 3.6.5 and a datetime.time object for the index
print(sum_by_time)
Trips
Time
00:00:00 10
01:00:00 10
02:00:00 10
03:00:00 10
04:00:00 20
05:00:00 20
06:00:00 20
07:00:00 20
08:00:00 30
09:00:00 30
10:00:00 30
11:00:00 30
How can I group this dataframe by time interval to get something like this:
Trips
Time
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120
I think need convert index values to timedeltas by to_timedelta and then resample:
df.index = pd.to_timedelta(df.index.astype(str))
df = df.resample('4H').sum()
print (df)
Trips
00:00:00 40
04:00:00 80
08:00:00 120
EDIT:
For your format need:
df['d'] = pd.to_datetime(df.index.astype(str))
df = df.groupby(pd.Grouper(freq='4H', key='d')).agg({'Trips':'sum', 'd':['first','last']})
df.columns = df.columns.map('_'.join)
df = df.set_index(df['d_first'].dt.strftime('%H:%M:%S') + ' - ' + df['d_last'].dt.strftime('%H:%M:%S'))[['Trips_sum']]
print (df)
Trips_sum
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120

Categories