How to find and remove rows from DataFrame with values in a specific range, for example dates greater than '2017-03-02' and smaller than '2017-03-05'
import pandas as pd
d_index = pd.date_range('2018-01-01', '2018-01-06')
d_values = pd.date_range('2017-03-01', '2017-03-06')
s = pd.Series(d_values)
s = s.rename('values')
df = pd.DataFrame(s)
df = df.set_index(d_index)
# remove rows with specific values in 'value' column
In example above I have d_values ordered from earliest to the latest date so in this case slicing dataframe by index could do the work. But I am looking for solution that would work also when d_values contain not ordered random date values. Is there any way to do it in pandas?
Option 1
pd.Series.between seems suited for this task.
df[~df['values'].between('2017-03-02', '2017-03-05', inclusive=False)]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Details
between identifies all items within the range -
m = df['values'].between('2017-03-02', '2017-03-05', inclusive=False)
m
2018-01-01 False
2018-01-02 False
2018-01-03 True
2018-01-04 True
2018-01-05 False
2018-01-06 False
Freq: D, Name: values, dtype: bool
Use the mask to filter on df -
df = df[~m]
Option 2
Alternatively, with the good ol' old logical OR -
df[~(df['values'].gt('2017-03-02') & df['values'].lt('2017-03-05'))]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Note that both options work with datetime objects as well as string date columns (in which case, the comparison is lexicographic).
first let's shuffle your DF:
In [65]: df = df.sample(frac=1)
In [66]: df
Out[66]:
values
2018-01-03 2017-03-03
2018-01-04 2017-03-04
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
you can use DataFrame.eval method (thanks # cᴏʟᴅsᴘᴇᴇᴅ for the correction!):
In [70]: df[~df.eval("'2017-03-02' < values < '2017-03-05'")]
Out[70]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
or DataFrame.query():
In [300]: df.query("not ('2017-03-02' < values < '2017-03-05')")
Out[300]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
Related
for example, If I want to do resampling for below using the sum for 1 day, I get expected results (5 data points)
idx = pd.date_range('2018-01-01', periods=100, freq='H')
ts = pd.Series(range(len(idx)), index=idx)
data_sum= ts.resample('1d').agg(['sum'])
But, I get 100 datapoints for cumsum eventhough I did resampling for 1 day using the same approach.
data_cumsum= ts.resample('1d').agg(['cumsum'])
isn't it suppose to return only 5 data points ? Why cumsum is behaving differently than other aggs ?
Answer is simple - most functions aggregate data like sum, mean, but some not like cumsum, diff, ffill, bfill.
So it is reason for difference in resample and also in groupby.
Here is possible use Resampler.transform - it repeat resampled data, so got 100rows, for cumulative sum is not resampler implemented, so used alternative with Grouper and GroupBy.cumsum:
data_sum= ts.resample('1d').transform('sum')
data_cumsum= ts.groupby(pd.Grouper(freq='1d')).cumsum()
print (data_sum)
2018-01-01 00:00:00 276
2018-01-01 01:00:00 276
2018-01-01 02:00:00 276
2018-01-01 03:00:00 276
2018-01-01 04:00:00 276
2018-01-04 23:00:00 2004
2018-01-05 00:00:00 390
2018-01-05 01:00:00 390
2018-01-05 02:00:00 390
2018-01-05 03:00:00 390
Freq: H, Length: 100, dtype: int64
print (data_cumsum)
2018-01-01 00:00:00 0
2018-01-01 01:00:00 1
2018-01-01 02:00:00 3
2018-01-01 03:00:00 6
2018-01-01 04:00:00 10
2018-01-04 23:00:00 2004
2018-01-05 00:00:00 96
2018-01-05 01:00:00 193
2018-01-05 02:00:00 291
2018-01-05 03:00:00 390
Freq: H, Length: 100, dtype: int64
(not a duplicate question)
I have the following datasets:
GMT TIME, Value
2018-01-01 00:00:00, 1.2030
2018-01-01 00:01:00, 1.2000
2018-01-01 00:02:00, 1.2030
2018-01-01 00:03:00, 1.2030
.... , ....
2018-12-31 23:59:59, 1.2030
I am trying to find a way to remove the following:
hh:mm:ss form the datetime
After removing the time (hh:mm:ss) section, we will have duplicate date entry like multiple 2018-01-01 and so on... so I need to remove the duplicate date data and only keep the last date, before the next date, eg 2018-01-02 and similarly keep the last 2018-01-02 before the next date 2018-01-03 and repeat...
How can I do it with Pandas?
Suppose you have data:
GMT TIME Value
0 2018-01-01 00:00:00 1.203
1 2018-01-01 00:01:00 1.200
2 2018-01-01 00:02:00 1.203
3 2018-01-01 00:03:00 1.203
4 2018-01-02 00:03:00 1.203
5 2018-01-03 00:03:00 1.203
6 2018-01-04 00:03:00 1.203
7 2018-12-31 23:59:59 1.203
Use pandas.to_datetime.dt.date with pandas.DataFrame.groupby:
import pandas as pd
df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df.groupby(df['GMT TIME']).last()
Output:
Value
GMT TIME
2018-01-01 1.203
2018-01-02 1.203
2018-01-03 1.203
2018-01-04 1.203
2018-12-31 1.203
Or use pandas.DataFrame.drop_duplicates:
df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df.drop_duplicates('GMT TIME', 'last')
Output:
GMT TIME Value
3 2018-01-01 1.203
4 2018-01-02 1.203
5 2018-01-03 1.203
6 2018-01-04 1.203
7 2018-12-31 1.203
Using duplicated
#df['GMT TIME'] = pd.to_datetime(df['GMT TIME']).dt.date
df[~df['GMT TIME'].dt.date.iloc[::-1].duplicated()]\
Or using
df.groupby(df['GMT TIME'].dt.date).tail(1)
I have grouped timeseries with gaps. I wan't to fill the gaps, respecting the groupings.
date is unique within each id.
The following works but gives me zero's where I wan't NaN's
data.groupby('id').resample('D', on='date').sum()\
.drop('id', axis=1).reset_index()
The following do not work for some reason
data.groupby('id').resample('D', on='date').asfreq()\
.drop('id', axis=1).reset_index()
data.groupby('id').resample('D', on='date').fillna('pad')\
.drop('id', axis=1).reset_index()
I get the following error:
Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like
I've tried to use the pandas.Grouper with set_index multilevel index or single but it do not seems to upsample my date column so i get continous dates or it do not respect the id column.
Pandas is version 0.23
Try it yourself:
data = pd.DataFrame({
'id': [1,1,1,2,2,2],
'date': [
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10),
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10)],
'value': [100, 110, 90, 50, 40, 60]})
# Works but gives zeros
data.groupby('id').resample('D', on='date').sum()
# Fails
data.groupby('id').resample('D', on='date').asfreq()
data.groupby('id').resample('D', on='date').fillna('pad')
Create DatetimeIndex and remove parameter on from resample:
print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0
print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
EDIT:
If want use sum with missing values need min_count=1 parameter - sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
print (data.groupby('id').resample('D', on='date').sum(min_count=1))
I have data that I wish to groupby week.
I have been able to do this using the following
Data_Frame.groupby([pd.Grouper(freq='W')]).count()
this creates a dataframe in the form of
2018-01-07 ...
2018-01-14 ...
2018-01-21 ...
which is great. However I need it to start at 06:00, so something like
2018-01-07 06:00:00 ...
2018-01-14 06:00:00 ...
2018-01-21 06:00:00 ...
I am aware that I could shift my data by 6 hours but this seems like a cheat and I'm pretty sure Grouper comes with the functionality to do this (some way of specifying when it should start grouping).
I was hoping someone who know of a good method of doing this.
Many Thanks
edit:
I'm trying to use pythons actual in built functionality more since it often works much better and more consistently. I also turn the data itself into a graph with the timestamps as the y column and I would want the timestamp to actuality reflect the data, without some method such as shifting everything by 6 hours grouping it and then reshifting everything back 6 hours to get the right timestamp .
Use double shift:
np.random.seed(456)
idx = pd.date_range(start = '2018-01-07', end = '2018-01-09', freq = '2H')
df = pd.DataFrame({'a':np.random.randint(10, size=25)}, index=idx)
print (df)
a
2018-01-07 00:00:00 5
2018-01-07 02:00:00 9
2018-01-07 04:00:00 4
2018-01-07 06:00:00 5
2018-01-07 08:00:00 7
2018-01-07 10:00:00 1
2018-01-07 12:00:00 8
2018-01-07 14:00:00 3
2018-01-07 16:00:00 5
2018-01-07 18:00:00 2
2018-01-07 20:00:00 4
2018-01-07 22:00:00 2
2018-01-08 00:00:00 2
2018-01-08 02:00:00 8
2018-01-08 04:00:00 4
2018-01-08 06:00:00 8
2018-01-08 08:00:00 5
2018-01-08 10:00:00 6
2018-01-08 12:00:00 0
2018-01-08 14:00:00 9
2018-01-08 16:00:00 8
2018-01-08 18:00:00 2
2018-01-08 20:00:00 3
2018-01-08 22:00:00 6
2018-01-09 00:00:00 7
#freq='D' for easy check, in original use `W`
df1 = df.shift(-6, freq='H').groupby([pd.Grouper(freq='D')]).count().shift(6, freq='H')
print (df1)
a
2018-01-06 06:00:00 3
2018-01-07 06:00:00 12
2018-01-08 06:00:00 10
So to solve this one needs to use the base parameter for Grouper.
However the caveat is that whatever time period used (years, months, days etc..) for Freq, base will also be in it (from what I can tell).
So as I want to displace the starting position by 6 hours then my freq needs to be in hours rather than weeks (i.e. 1W = 168H).
So the solution I was looking for was
Data_Frame.groupby([pd.Grouper(freq='168H', base = 6)]).count()
This is simple, short, quick and works exactly as I want it to.
Thanks to all the other answers though
I would create another column with the required dates, and groupby on them
import pandas as pd
import numpy as np
selected_datetime = pd.date_range(start = '2018-01-07', end = '2018-01-30', freq = '1H')
df = pd.DataFrame(selected_datetime, columns = ['date'])
df['value1'] = np.random.rand(df.shape[0])
# specify the condition for your date, eg. starting from 6am
df['shift1'] = df['date'].apply(lambda x: x.date() if x.hour == 6 else np.nan)
# forward fill the na values to have last date
df['shift1'] = df['shift1'].fillna(method = 'ffill')
# you can groupby on this col
df.groupby('shift1')['value1'].mean()
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks