Access neighbour rows in Pandas.Dataframe - python

I'm trying to calculate local max and min for a series of data: if current row value is greater or lower both following and preceding row, set it to current value, else set to NaN. Is there any more elegant way to do it, other than this one:
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2014', periods=10, freq='5min')
s = pd.Series([1, 2, 3, 2, 1, 2, 3, 5, 7, 4], index=rng)
df = pd.DataFrame(s, columns=['val'])
df.index.name = "dt"
df['minmax'] = np.NaN
for i in range(len(df.index)):
if i == 0:
continue
if i == len(df.index) - 1:
continue
if df['val'][i] >= df['val'][i - 1] and df['val'][i] >= df['val'][i + 1]:
df['minmax'][i] = df['val'][i]
continue
if df['val'][i] <= df['val'][i - 1] and df['val'][i] <= df['val'][i + 1]:
df['minmax'][i] = df['val'][i]
continue
print(df)
Result is:
val minmax
dt
2014-01-01 00:00:00 1 NaN
2014-01-01 00:05:00 2 NaN
2014-01-01 00:10:00 3 3
2014-01-01 00:15:00 2 NaN
2014-01-01 00:20:00 1 1
2014-01-01 00:25:00 2 NaN
2014-01-01 00:30:00 3 NaN
2014-01-01 00:35:00 5 NaN
2014-01-01 00:40:00 7 7
2014-01-01 00:45:00 4 NaN

We can use shift and where to determine what to assign the values, importantly we have to use the bit comparators & and | when comparing series. Shift will return a Series or DataFrame shifted by 1 row (default) or the passed value.
When using where we can pass a boolean condition and the second param NaN tells it to assign this value if False.
In [81]:
df['minmax'] = df['val'].where(((df['val'] < df['val'].shift(1))&(df['val'] < df['val'].shift(-1)) | (df['val'] > df['val'].shift(1))&(df['val'] > df['val'].shift(-1))), NaN)
df
Out[81]:
val minmax
dt
2014-01-01 00:00:00 1 NaN
2014-01-01 00:05:00 2 NaN
2014-01-01 00:10:00 3 3
2014-01-01 00:15:00 2 NaN
2014-01-01 00:20:00 1 1
2014-01-01 00:25:00 2 NaN
2014-01-01 00:30:00 3 NaN
2014-01-01 00:35:00 5 NaN
2014-01-01 00:40:00 7 7
2014-01-01 00:45:00 4 NaN

Related

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

Group pandas rows into pairs then find timedelta

I have a dataframe where I need to group the TX/RX column into pairs, and then put these into a new dataframe with a new index and the timedelta between them as values.
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = pd.date_range('2018-01-01', periods=6, freq='1H1min')
df['id'] = ids
df['val'] = vals
time1 time2 id val
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A
3 2018-01-01 03:00:00 2018-01-01 03:03:00 4 B
4 2018-01-01 04:00:00 2018-01-01 04:04:00 5 A
5 2018-01-01 05:00:00 2018-01-01 05:05:00 6 B
needs to be...
index timedelta A B
0 1 1 2
1 1 3 4
2 1 5 6
I think that pivot_tables or stack/unstack is probably the best way to go about this, but I'm not entirely sure how...
I believe you need:
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = df['time1'] + pd.to_timedelta([60,60,120,120,180,180], 's')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
df['t'] = df['time2'] - df['time1']
print (df)
time1 time2 id val t
0 2018-01-01 00:00:00 2018-01-01 00:01:00 1 A 00:01:00
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B 00:01:00
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A 00:02:00
3 2018-01-01 03:00:00 2018-01-01 03:02:00 4 B 00:02:00
4 2018-01-01 04:00:00 2018-01-01 04:03:00 5 A 00:03:00
5 2018-01-01 05:00:00 2018-01-01 05:03:00 6 B 00:03:00
#if necessary convert to seconds
#df['t'] = (df['time2'] - df['time1']).dt.total_seconds()
df = df.pivot('t','val','id').reset_index().rename_axis(None, axis=1)
#if necessary aggregate values
#df = (df.pivot_table(index='t',columns='val',values='id', aggfunc='mean')
# .reset_index().rename_axis(None, axis=1))
print (df)
t A B
0 00:01:00 1 2
1 00:02:00 3 4
2 00:03:00 5 6

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.
Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Return first matching value/column name in new dataframe

import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=6, freq='H')
df = pd.DataFrame({'A': [0, 1, 2, 3, 4,5],
'B': [0, 1, 2, 3, 4,5],
'C': [0, 1, 2, 3, 4,5],
'D': [0, 1, 2, 3, 4,5],
'E': [1, 2, 3, 3, 7,6],
'F': [1, 1, 3, 3, 7,6],
'G': [0, 0, 1, 0, 0,0]
},
index=rng)
A simple dataframe to help me explain:
df
A B C D E F G
2011-01-01 00:00:00 0 0 0 0 1 1 0
2011-01-01 01:00:00 1 1 1 1 2 1 0
2011-01-01 02:00:00 2 2 2 2 3 3 1
2011-01-01 03:00:00 3 3 3 3 3 3 0
2011-01-01 04:00:00 4 4 4 4 7 7 0
2011-01-01 05:00:00 5 5 5 5 6 6 0
When I filter for a value greater than 2 I get the following output:
df[df >= 2]
A B C D E F G
2011-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 NaN NaN NaN NaN 2.0 NaN NaN
2011-01-01 02:00:00 2.0 2.0 2.0 2.0 3.0 3.0 NaN
2011-01-01 03:00:00 3.0 3.0 3.0 3.0 3.0 3.0 NaN
2011-01-01 04:00:00 4.0 4.0 4.0 4.0 7.0 7.0 NaN
2011-01-01 05:00:00 5.0 5.0 5.0 5.0 6.0 6.0 NaN
For each row I want to know which column has the matching value first (working from left to right). So on the row for 2011-01-01 01:00:00 it was row E ands the value was 2.0.
Desired output:
What I would like to get is a new dataframe with the first match value in a column named 'Value' and another column called "From Col" which captures the column name this came from.
If no match seen then the output from the last column (G in this case). Thanks for any help.
"Value" "From Col"
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
Try this:
def get_first_valid(ser):
if len(ser) == 0:
return pd.Series([np.nan,np.nan])
mask = pd.isnull(ser.values)
i = mask.argmin()
if mask[i]:
return pd.Series([np.nan, ser.index[-1]])
else:
return pd.Series([ser[i], ser.index[i]])
In [113]: df[df >= 2].apply(get_first_valid, axis=1)
Out[113]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2.0 E
2011-01-01 02:00:00 2.0 A
2011-01-01 03:00:00 3.0 A
2011-01-01 04:00:00 4.0 A
2011-01-01 05:00:00 5.0 A
or:
In [114]: df[df >= 2].T.apply(get_first_valid).T
Out[114]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
PS i took a source code of the Series.first_valid_index() function and made a dirty hack out of it...
Explanation:
In [221]: ser = pd.Series([np.nan, np.nan, 5, 7, np.nan])
In [222]: ser
Out[222]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
dtype: float64
In [223]: mask = pd.isnull(ser.values)
In [224]: mask
Out[224]: array([ True, True, False, False, True], dtype=bool)
In [225]: i = mask.argmin()
In [226]: i
Out[226]: 2
In [227]: ser.index[i]
Out[227]: 2
In [228]: ser[i]
Out[228]: 5.0
Firstly, filter values according to criterion and drop the row containing all NaNs. Then, use idxmax to return the first occurence of a True condition. This resembles our first series.
To create the second series, iterate over (index, value) tuple pairs of the first series and simultaneously append these locations present in the original DF.
ser1 = (df[df.ge(2)].dropna(how='all').ge(2)).idxmax(1)
ser2 = pd.concat([pd.Series(df.loc[i,r], pd.Index([i])) for i, r in ser1.iteritems()])
Create a new DF whose index pertains to that of the original DF and fill the missing values in From Col with that of it's last column name.
req_df = pd.DataFrame({"From Col": ser1, "Value": ser2}, index=df.index)
req_df['From Col'].fillna(df.columns[-1], inplace=True)
req_df
I don't work with pandas, so this can be considered just as a footnote, but in pure python there is also possibility to find first non-None index using reduce.
>>> a
[None, None, None, None, 6, None, None, None, 3, None]
>>> print( reduce(lambda x, y: (x or y[1] and y[0]), enumerate(a), None))
4

Unable to convert to datetime using pd.to_datetime

I am trying to read a csv file and convert it to a dataframe to be used as a time series.
The csv file is of this type:
#Date Time CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaN NaN %
1 NaN NaN Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
I read the file using:
df = pd.read_csv ('filepath/file.csv', sep=';', parse_dates = [[0,1]])
producing this result:
#Date_Time FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 nan nan %
1 nan nan Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
to continue converting string to datetime and using it as index:
pd.to_datetime(df.values[:,0])
df.set_index([df.columns[0]], inplace=True)
so i get this:
FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
#Date_Time
nan nan %
nan nan Cooling Coil Hydronic Valve Position
2014-01-01 00:00:00 0
2014-01-01 01:00:00 0
2014-01-01 02:00:00 0
2014-01-01 03:00:00 0
2014-01-01 04:00:00 0
However, the pd.to_datetime is unable to convert to datetime. Is there a way of finding out what is the error?
Many thanks.
Luis
The string entry 'nan nan' cannot be converted using to_datetime, so replace these with an empty string so that they can now be converted to NaT:
In [122]:
df['Date_Time'].replace('nan nan', '',inplace=True)
df
Out[122]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 0 %
1 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
In [124]:
df['Date_Time'] = pd.to_datetime(df['Date_Time'])
df
Out[124]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaT 0 %
1 NaT 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
UPDATE
Actually if you just set coerce=True then it converts fine:
df['Date_Time'] = pd.to_datetime(df['Date_Time'], coerce=True)

Categories