Pandas: Remove duplicate dates but keeping the last

Pandas: Remove duplicate dates but keeping the last - python

(not a duplicate question)
I have the following datasets:
GMT TIME, Value
2018-01-01 00:00:00, 1.2030
2018-01-01 00:01:00, 1.2000
2018-01-01 00:02:00, 1.2030
2018-01-01 00:03:00, 1.2030
.... , ....
2018-12-31 23:59:59, 1.2030
I am trying to find a way to remove the following:
hh:mm:ss form the datetime
After removing the time (hh:mm:ss) section, we will have duplicate date entry like multiple 2018-01-01 and so on... so I need to remove the duplicate date data and only keep the last date, before the next date, eg 2018-01-02 and similarly keep the last 2018-01-02 before the next date 2018-01-03 and repeat...
How can I do it with Pandas?

Suppose you have data:
GMT TIME Value
0 2018-01-01 00:00:00 1.203
1 2018-01-01 00:01:00 1.200
2 2018-01-01 00:02:00 1.203
3 2018-01-01 00:03:00 1.203
4 2018-01-02 00:03:00 1.203
5 2018-01-03 00:03:00 1.203
6 2018-01-04 00:03:00 1.203
7 2018-12-31 23:59:59 1.203
Use pandas.to_datetime.dt.date with pandas.DataFrame.groupby:
import pandas as pd
df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df.groupby(df['GMT TIME']).last()
Output:
Value
GMT TIME
2018-01-01 1.203
2018-01-02 1.203
2018-01-03 1.203
2018-01-04 1.203
2018-12-31 1.203
Or use pandas.DataFrame.drop_duplicates:
df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df.drop_duplicates('GMT TIME', 'last')
Output:
GMT TIME Value
3 2018-01-01 1.203
4 2018-01-02 1.203
5 2018-01-03 1.203
6 2018-01-04 1.203
7 2018-12-31 1.203

Using duplicated
#df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df[~df['GMT TIME'].dt.date.iloc[::-1].duplicated()]\
Or using
df.groupby(df['GMT TIME'].dt.date).tail(1)

Related

How to set a multiindex with multiple dates in pandas?

I have the following dataframe df:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 02:00 2018-01-01 03:00 2
2018-01-01 03:00 2018-01-01 04:00 3
2018-01-01 04:00 2018-01-01 05:00 6
I want to set a multi index composed of Datetime1 and Datetime2 to further proceed with the data resampling and interpolation (from 1 hour to 30 minutes frequency).
If I do df.set_index(["Datetime1","Datetime2"]).resample("30T").ffill(), then it fails.
Desired output:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 00:30 2018-01-01 01:30 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 01:30 2018-01-01 02:30 1
...

If there is one hour difference is possible create MultiIndex after resample with add 1H to new DatetimeIndex:
df = df.set_index(["Datetime1"])[['Value']].resample("30T").ffill()
df = df.set_index([df.index.rename('Datetime2') + pd.Timedelta('1H')], append=True)
print (df)
Value
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Or:
s = df.set_index(["Datetime1"])['Value'].resample("30T").ffill()
s.index = [s.index,s.index.rename('Datetime2') + pd.Timedelta('1H')]
print (s)
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Name: Value, dtype: int64

The multi-index is not meant for a double-index but for a hierarchical (grouped) index. See the docs. You said in the comments, that Datetime2 is always offset by 1 hour. That means it's probably fastest to recalculate it:
df.set_index("Datetime1","Datetime2").resample("30T").ffill()
df["Datetime2" = df.index + pd.Timedelta(1, "hour")

Find value cycles in time series data

I have a large time-series > 5 million rows, the values in time series fluctuate randomly between 2-10:
A small section of time-series:
I want to identify a certain pattern from this time series, pattern:
when the value of pct_change is >= threshold " T " I want to raise a flag that says reading begins
if the value of pct_change is >= T or < T and !=0 after reading begins flag has been raised then a reading continue flag should be raised until a zero is encountered
if a zero is encountered then a reading stop flag should be raised if the value of pct_change is < T after this flag has been raised then a not reading flag should be raised.
I want to write a function that can tell me how many times and for what duration this happened.
If we take a threshold T of 4 and use pct_change from the example data screenshot then the output that I want is :
The main goal behind this is to find how many times this cycle is repeating for different thresholds.
To generate sample data :
import pandas as pd
a = [2,3,4,2,0,14,5,6,3,2,0,4,5,7,8,10,4,0,5,6,7,10,7,6,4,2,0,1,2,5,6]
idx = pd.date_range("2018-01-01", periods=len(a), freq="H")
ts = pd.Series(a, index=idx)
dd = pd.DataFrame()
dd['pct_change'] =ts
dd.head()
Can you please suggest an efficient way of doing it?
Output that I want if threshold 'T' is >= 4 :

First, keep only interesting data (>= T | == 0):
threshold = 4
df = dd.loc[dd["pct_change"].ge(threshold) | dd["pct_change"].eq(0)]
>>> df
pct_change
2018-01-01 02:00:00 4 # group 0, end=2018-01-01 04:00:00
2018-01-01 04:00:00 0
2018-01-01 05:00:00 14 # group 1, end=2018-01-01 10:00:00
2018-01-01 06:00:00 5
2018-01-01 07:00:00 6
2018-01-01 10:00:00 0
2018-01-01 11:00:00 4 # group 2, end=2018-01-01 17:00:00
2018-01-01 12:00:00 5
2018-01-01 13:00:00 7
2018-01-01 14:00:00 8
2018-01-01 15:00:00 10
2018-01-01 16:00:00 4
2018-01-01 17:00:00 0
2018-01-01 18:00:00 5 # group 3, end=2018-01-02 02:00:00
2018-01-01 19:00:00 6
2018-01-01 20:00:00 7
2018-01-01 21:00:00 10
2018-01-01 22:00:00 7
2018-01-01 23:00:00 6
2018-01-02 00:00:00 4
2018-01-02 02:00:00 0
2018-01-02 05:00:00 5 # group 4, end=2018-01-02 06:00:00
2018-01-02 06:00:00 6
Then, create wanting groups:
groups = df["pct_change"].eq(0).shift(fill_value=0).cumsum()
>>> groups
2018-01-01 02:00:00 0 # group 0
2018-01-01 04:00:00 0
2018-01-01 05:00:00 1 # group 1
2018-01-01 06:00:00 1
2018-01-01 07:00:00 1
2018-01-01 10:00:00 1
2018-01-01 11:00:00 2 # group 2
2018-01-01 12:00:00 2
2018-01-01 13:00:00 2
2018-01-01 14:00:00 2
2018-01-01 15:00:00 2
2018-01-01 16:00:00 2
2018-01-01 17:00:00 2
2018-01-01 18:00:00 3 # group 3
2018-01-01 19:00:00 3
2018-01-01 20:00:00 3
2018-01-01 21:00:00 3
2018-01-01 22:00:00 3
2018-01-01 23:00:00 3
2018-01-02 00:00:00 3
2018-01-02 02:00:00 3
2018-01-02 05:00:00 4 # group 4
2018-01-02 06:00:00 4
Name: pct_change, dtype: object
Finally, use groups to output result:
out = pd.DataFrame(df.groupby(groups) \
.apply(lambda x: (x.index[0], x.index[-1])) \
.tolist(), columns=["StartTime", "EndTime"])
>>> out
StartTime EndTime
0 2018-01-01 02:00:00 2018-01-01 04:00:00 # group 0
1 2018-01-01 05:00:00 2018-01-01 10:00:00 # group 1
2 2018-01-01 11:00:00 2018-01-01 17:00:00 # group 2
3 2018-01-01 18:00:00 2018-01-02 02:00:00 # group 3
4 2018-01-02 05:00:00 2018-01-02 06:00:00 # group 4
Bonus
There are some case where you have to remove groups:
The first pct value is 0
Two or more consecutive pct value is 0
To remove them:
out = out[~out["StartTime"].eq(out["EndTime"])]

pandas put in a dataframe the daily hour (and only the hour) where there is the max

I have this dataframe:
dates,rr.price,ax.price,be.price
2018-01-01 00:00:00,45.73,45.83,47.63
2018-01-01 01:00:00,44.16,44.59,44.42
2018-01-01 02:00:00,42.24,40.22,42.34
2018-01-01 03:00:00,39.29,37.31,38.36
2018-01-01 04:00:00,36.0,32.88,36.87
2018-01-01 05:00:00,41.99,39.27,39.79
2018-01-01 06:00:00,42.25,43.62,42.08
2018-01-01 07:00:00,44.97,49.69,51.19
2018-01-01 08:00:00,45.0,49.98,59.69
2018-01-01 09:00:00,44.94,48.04,56.67
2018-01-01 10:00:00,45.04,46.85,53.54
2018-01-01 11:00:00,46.67,47.95,52.6
2018-01-01 12:00:00,46.99,46.6,50.77
2018-01-01 13:00:00,44.16,43.02,50.27
2018-01-01 14:00:00,45.26,44.2,50.64
2018-01-01 15:00:00,47.84,47.1,54.79
2018-01-01 16:00:00,50.1,50.83,60.17
2018-01-01 17:00:00,54.3,58.31,59.47
2018-01-01 18:00:00,51.91,63.5,60.16
2018-01-01 19:00:00,51.38,61.9,70.81
2018-01-01 20:00:00,49.2,59.62,62.65
2018-01-01 21:00:00,45.73,52.84,59.71
2018-01-01 22:00:00,44.84,51.43,50.96
2018-01-01 23:00:00,38.11,45.35,46.52
2018-01-02 00:00:00,19.19,41.61,49.62
2018-01-02 01:00:00,14.99,40.78,45.05
2018-01-02 02:00:00,11.0,39.59,45.18
2018-01-02 03:00:00,10.0,36.95,37.12
2018-01-02 04:00:00,11.83,31.38,38.03
2018-01-02 05:00:00,14.99,34.02,46.17
2018-01-02 06:00:00,40.6,41.27,51.71
2018-01-02 07:00:00,46.99,48.25,54.37
2018-01-02 08:00:00,47.95,43.57,75.3
2018-01-02 09:00:00,49.9,48.34,68.48
2018-01-02 10:00:00,50.0,48.01,61.94
2018-01-02 11:00:00,49.7,52.22,63.26
2018-01-02 12:00:00,48.16,47.47,59.41
2018-01-02 13:00:00,47.24,47.61,60.0
2018-01-02 14:00:00,46.1,49.12,67.44
2018-01-02 15:00:00,47.6,52.38,66.82
2018-01-02 16:00:00,50.45,58.35,72.17
2018-01-02 17:00:00,54.9,61.4,70.28
2018-01-02 18:00:00,57.18,54.58,62.63
2018-01-02 19:00:00,54.9,53.66,63.78
2018-01-02 20:00:00,51.2,54.15,63.08
2018-01-02 21:00:00,48.82,48.67,56.42
2018-01-02 22:00:00,45.14,47.46,49.85
2018-01-02 23:00:00,40.09,42.46,43.87
2018-01-03 00:00:00,42.75,34.72,25.51
2018-01-03 01:00:00,35.02,30.31,21.07
2018-01-03 02:00:00,28.85,25.35,16.8
I want to have another dataframe where there is for each day the hour of the day where there is the max value of rr.price,dates,rr.price,be.price.
What I have done so far is this:
im = 1
dfr_im = dfr[dfr.index.month == im]
because I want to do that for each month of my original dataframe where I have an entire year.
After that, I do:
dfr_h = dfr_im.groupby(dfr_im.index.date)['rr.price','ax.price','be.price'].idxmax()
This is result:
,rr.price,ax.price,be.price
2018-01-01,2018-01-01 17:00:00,2018-01-01 18:00:00,2018-01-01 19:00:00
2018-01-02,2018-01-02 18:00:00,2018-01-02 17:00:00,2018-01-02 08:00:00
2018-01-03,2018-01-03 00:00:00,2018-01-03 00:00:00,2018-01-03 00:00:00
However, I would like to have
,rr.price,ax.price,be.price
2018-01-01,17,18,19
2018-01-02,18,17,8
2018-01-03,0,0,0
In addition I would like to not only considering all 24 hours in a day but as additional columns I would like to consider only some hour in order to compute the hour of the day with the max value. For example I would like to consider the hours between [0-8] or [0-8+20-23].
Thanks

you can stack get the hour and unstack
dfr_im.groupby(dfr_im.index.date)[['rr.price','ax.price','be.price']].idxmax().stack().dt.hour.unstack()
You can use between_time and then do the computation above on that slice. If you only want to look at one time frame then:
df_f = dfr_im.between_time('00:00', '08:00')
df_f.groupby(df_f.index.date)[['rr.price','ax.price','be.price']].idxmax().stack().dt.hour.unstack()
Or if you want to look at two times you can use loc with numpy's concatenate function
df_f = dfr_im.loc[np.concatenate([dfr_im.between_time('00:00', '08:00').index,
dfr_im.between_time('20:00', '23:00').index])]
df_f.groupby(df_f.index.date)[['rr.price','ax.price','be.price']].idxmax().stack().dt.hour.unstack()

Counting continuous nan values in panda Time series

I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain

groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1

Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1

If you have the indices where the values occur, you can use itertools as in this to find continuous chunks

Pandas .resample() or .asfreq() fill forward times

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?

Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Remove duplicate dates but keeping the last - python

Using duplicated #df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date df[~df['GMT TIME'].dt.date.iloc[::-1].duplicated()]\ Or using df.groupby(df['GMT TIME'].dt.date).tail(1)

Related

How to set a multiindex with multiple dates in pandas?

Find value cycles in time series data

pandas put in a dataframe the daily hour (and only the hour) where there is the max

Counting continuous nan values in panda Time series

Pandas .resample() or .asfreq() fill forward times

Categories

Resources