If else condition inside df.loc pandas - python

I have following dataframe:
time u10 ... tsn tp
longitude latitude ...
20.0 45.0 2014-01-01 00:00:00 2.595551 ... 272.453827 0.000014
45.0 2014-01-01 01:00:00 2.615493 ... 273.159973 0.000000
45.0 2014-01-01 02:00:00 2.587403 ... 273.122192 0.000000
45.0 2014-01-01 03:00:00 2.528865 ... 273.050903 0.000000
45.0 2014-01-01 04:00:00 2.556740 ... 272.772491 0.000000
I want to subtract neighboring records of column u10 for all values of column time,except one value of column time ( where time ends with 00:00:00 )
I need following output:
time u10 ... tsn tp
longitude latitude ...
20.0 45.0 2014-01-01 00:00:00 2.595551 ... 272.453827 0.000014
45.0 2014-01-01 01:00:00 0.019942 ... 273.159973 0.000000
45.0 2014-01-01 02:00:00 -0.02809 ... 273.122192 0.000000
45.0 2014-01-01 03:00:00 -0.058538 ... 273.050903 0.000000
I can do df.loc combining with df['time'].shift()-df['time'] but that will work for all record.
How can I make this work with desired output?
P.S. I am looking for vectorized solution.

Use np.where. Where midnight occurs, keep u10 values, where it doesnt find the consecutive differences.
df['u10']=np.where(df['time'].dt.time==time(0,0,0), df['u10'], df['u10'].diff())
time u10
longitude latitude
20.0 45.0 2014-01-01 00:00:00 2.595551
45.0 2014-01-01 01:00:00 0.019942
45.0 2014-01-01 02:00:00 -0.028090
45.0 2014-01-01 03:00:00 -0.058538
45.0 2014-01-01 04:00:00 0.027875

Related

Expand a time series in the form of numpy.array(), pandas.DataFrame(), or xarray.DataSet() to contain the missing records as NaN

import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64

How to filter rows based on datetime index

I have a dataframe that has an index of date type and contains hourly data per day.
I need to take only the rows which have the last record of the day
and filter out REST OF THE RECORDS FOR THAT DAY ONLY...
and do the same for other days.
COL-A
DATE
2014-01-01 00:56:00 NaN
2014-01-01 01:56:00 NaN
2014-01-01 02:56:00 NaN
2014-01-01 03:56:00 NaN
2014-01-01 04:00:00 NaN
2014-01-01 04:56:00 42.0
2014-01-01 05:56:00 NaN
2014-01-01 06:56:00 NaN
2014-01-01 07:56:00 NaN
2014-01-01 08:56:00 NaN
2014-01-01 09:56:00 NaN
2014-01-01 10:00:00 19.0
2014-01-01 10:56:00 NaN
2014-01-01 11:56:00 NaN
2014-01-01 12:56:00 NaN
2014-01-01 13:56:00 NaN
2014-01-01 14:56:00 NaN
2014-01-01 15:56:00 NaN
2014-01-01 16:00:00 NaN
2014-01-01 16:56:00 36.0
2014-01-01 17:56:00 NaN
2014-01-01 18:56:00 NaN
2014-01-01 19:56:00 NaN
2014-01-01 20:56:00 NaN
2014-01-01 21:56:00 NaN
2014-01-01 22:00:00 NaN
2014-01-01 22:56:00 NaN
2014-01-01 23:56:00 NaN
2014-01-01 23:59:00 41.0
2014-01-02 00:56:00 NaN
...
...
...
I need to keep only the row
2014-01-01 23:59:00 41.0
Try this.
df.groupby([pd.Grouper(key = 'DATE', freq = 'd')]).last()

Concatenate all dataframe columns into a single column

I have a dataframe that looks roughly like:
01/01/19 02/01/19 03/01/19 04/01/19
hour
1.0 27.08 47.73 54.24 10.0
2.0 26.06 49.53 46.09 22.0
...
24.0 12.0 34.0 22.0 40.0
I'd like to reduce its dimension to a single column with a proper date index concatenating all the columns. Is there a smart pandas way to do it?
Expected result... something like:
01/01/19 00:00:00 27.08
01/01/19 01:00:00 26.08
...
01/01/19 23:00:00 12.00
02/01/19 00:00:00 47.73
02/01/19 01:00:00 49.53
...
02/01/19 23:00:00 34.00
...
You can stack and then fix the index using pd.to_datetime and pd.to_timedelta:
u = df.stack()
u.index = (pd.to_datetime(u.index.get_level_values(1), dayfirst=True)
+ pd.to_timedelta(u.index.get_level_values(0) - 1, unit='h'))
u.sort_index()
2019-01-01 00:00:00 27.08
2019-01-01 01:00:00 26.06
2019-01-01 23:00:00 12.00
2019-01-02 00:00:00 47.73
2019-01-02 01:00:00 49.53
2019-01-02 23:00:00 34.00
2019-01-03 00:00:00 54.24
2019-01-03 01:00:00 46.09
2019-01-03 23:00:00 22.00
2019-01-04 00:00:00 10.00
2019-01-04 01:00:00 22.00
2019-01-04 23:00:00 40.00
dtype: float64

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

How to resample with pandas in a conservative way?

I'm trying to use pandas dataframe/series to store measurements. These are used as inputs for a thermodynamical simulation. So when resampling the sum of heat(/energy) and the values of temperature should stay constant.
import pandas as pd
import numpy as np
start_date = '2015-01-01 '
my_timing = '60T'
my_ts_index = pd.date_range(start_date, periods=6, freq=my_timing, name='Time')
# read from a csv or sql data base
Heat = [120,200,210,0 ,50, 180]
Temperature = [23.2, 19.1, 15.3, 25.2, 20.1, 0.0]
data=np.array([(Heat),
(Temperature)])
columnNames=['Heat','Temperature']
my_ts=pd.DataFrame(np.transpose(data),columns=columnNames,index=my_ts_index)
print(my_ts)
# GUI checks if len(my_ts)%new_timestep==0
# increase frequency
tmp=my_ts.resample('25T').mean()
tmp['Heat']=tmp['Heat'].fillna(0)
tmp['Temperature']=tmp['Temperature'].ffill()
print(tmp)
# reduce frequency
tmp=tmp.resample('60T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
tmp=my_ts.resample('75T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
Output:
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 01:00:00 200.0 19.1
2015-01-01 02:00:00 210.0 15.3
2015-01-01 03:00:00 0.0 25.2
2015-01-01 04:00:00 50.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 00:25:00 0.0 23.2
2015-01-01 00:50:00 200.0 19.1
2015-01-01 01:15:00 0.0 19.1
2015-01-01 01:40:00 210.0 15.3
2015-01-01 02:05:00 0.0 15.3
2015-01-01 02:30:00 0.0 15.3
2015-01-01 02:55:00 0.0 25.2
2015-01-01 03:20:00 0.0 25.2
2015-01-01 03:45:00 50.0 20.1
2015-01-01 04:10:00 0.0 20.1
2015-01-01 04:35:00 0.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.833333
2015-01-01 01:00:00 210.0 17.200000
2015-01-01 02:00:00 0.0 18.600000
2015-01-01 03:00:00 50.0 22.650000
2015-01-01 04:00:00 0.0 20.100000
2015-01-01 05:00:00 180.0 0.000000
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.15
2015-01-01 01:15:00 210.0 15.30
2015-01-01 02:30:00 0.0 25.20
2015-01-01 03:45:00 50.0 20.10
2015-01-01 05:00:00 180.0 0.00
Which operations should be used to resample data to higher frequency and back to the original without loosing anything?
(See: Output 1&3+conservative properties in Output 2 of energy and temperature still have to hold)
How to set the mean/average/sum to start at the 2nd value so the first stays untouched for initial value reasons? Or inother words: How to assign the mean to the next not the previous hour?
(See: Output 4 first value pair)

Categories