I have a time series of daily rainfall data that looks like this:
PRCP
year_month_day
1797-01-01 00:00:00 0.0
1797-01-02 00:00:00 0.0
1797-01-03 00:00:00 1.1
1797-01-04 00:00:00 0.0
1797-01-05 00:00:00 3.5
1797-02-01 00:00:00 8.1
1797-02-02 00:00:00 3.0
1797-02-03 00:00:00 0.0
1797-02-04 00:00:00 0.0
1797-02-05 00:00:00 0.0
1797-03-01 00:00:00 0.0
1797-03-02 00:00:00 0.0
1797-03-03 00:00:00 0.0
1797-03-04 00:00:00 0.0
1797-03-05 00:00:00 1.5
1797-04-01 00:00:00 6.3
1797-04-02 00:00:00 24.0
1797-04-03 00:00:00 0.0
1797-04-04 00:00:00 2.2
1797-04-05 00:00:00 5.9
1797-05-01 00:00:00 0.0
1797-05-02 00:00:00 15.9
1797-05-03 00:00:00 0.0
1797-05-04 00:00:00 0.0
1797-05-05 00:00:00 0.0
1797-06-01 00:00:00 1.6
1797-06-02 00:00:00 0.0
1797-06-03 00:00:00 0.0
1797-06-04 00:00:00 7.9
1797-06-05 00:00:00 0.0
I have been able to import it with the index column as a pandas datetime object. I am trying to count all of the non-zero raindays per month. I can group by month with:
grouped = df.groupby(pd.Grouper(freq='M'))
and can count everything per month with:
raindays = grouped.resample("M").count()
But that also counts days with 0 rainfall. I found hints about using nunique(), but it doesn't seem to work with resample. eg:
raindays = grouped.resample("M").nunique()
returns error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'nunique'
Is there a way to count non zero values in a grouped pandas object?
Mask those 0s and try again.
df.mask(df.PRCP.eq(0)).groupby(pd.Grouper(freq='M')).count()
Or, the more obvious version with replace.
df.replace({0 : np.nan}).groupby(pd.Grouper(freq='M')).count()
PRCP
year_month_day
1797-01-31 2
1797-02-28 2
1797-03-31 1
1797-04-30 4
1797-05-31 1
1797-06-30 2
Using factorize and bincount
f, u = pd.factorize(df.index + pd.offsets.MonthEnd(0))
pd.Series(np.bincount(f, df.PRCP.values != 0).astype(int), u)
1797-01-31 2
1797-02-28 2
1797-03-31 1
1797-04-30 4
1797-05-31 1
1797-06-30 2
dtype: float64
Related
I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN.
Imagine that we have this other dataframe with the number of times the value is repeated per month:
date values
0 2013-01 200
1 2013-02 0
2 2013-03 2
3 2013-04 181
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
In this case, the month of January and April passed the stipulated value and that first dataframe should be:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.
You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:
import numpy as np
MISSING = -9999
THRESHOLD = 175
# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
df.groupby('month')['values']
.transform(lambda d: (d == MISSING).sum())
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
(df['values'] == MISSING) & (df['n_missing'] > THRESHOLD),
np.nan,
df['values']
)
This question already has answers here:
Pandas: Datetime Improperly selecting day as month from date [duplicate]
(2 answers)
Closed 1 year ago.
I am trying to resample a timeseries data from 5 min frequency to hourly average.
df = pd.read_csv("my_data.csv", index_col=False, usecols=['A','B','C'])
output:
A B C
0 16-01-21 0:00 95.75 0.0
1 16-01-21 0:05 90.10 0.0
2 16-01-21 0:10 86.26 0.0
3 16-01-21 0:15 92.72 0.0
4 16-01-21 0:20 81.54 0.0
df.A= pd.to_datetime(df.A)
Output:
A B C
0 2021-01-16 00:00:00 95.75 0.0
1 2021-01-16 00:05:00 90.10 0.0
2 2021-01-16 00:10:00 86.26 0.0
3 2021-01-16 00:15:00 92.72 0.0
4 2021-01-16 00:20:00 81.54 0.0
Now I set the Timestamp column as index,
df.set_index('A', inplace=True)
And when I try to resample with
df2 = df.resample('H').mean()
I am getting this,
B C
A
2021-01-02 00:00:00 79.970278 0.0
2021-01-02 01:00:00 77.951667 0.0
2021-01-02 02:00:00 77.610556 0.0
2021-01-02 03:00:00 80.800000 0.0
2021-01-02 04:00:00 84.305000 0.0
Was expecting this kind of timestamp with the average values for each hour,
A B C
2021-01-16 00:00:00 79.970278 0.0
2021-01-16 01:00:00 77.951667 0.0
2021-01-16 02:00:00 77.610556 0.0
2021-01-16 03:00:00 80.800000 0.0
2021-01-16 04:00:00 84.305000 0.0
I am not sure where I am making a mistake. Help me out.
I think problem here is some datetimes are wrongly converted:
#default is month first in df.A= pd.to_datetime(df.A)
01-02-21 -> 2021-01-02
Possible solutions:
df.A= pd.to_datetime(df.A, dayfirst=True)
Or:
df = pd.read_csv("my_data.csv",
index_col=False,
usecols=['A','B','C'],
parse_dates=['A'],
dayfirst=True)
I've got a dataframe called new_dh of web request that looks like (there are more columns
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 W3SVC1 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
What I would like to do is group by the hours(the actual date of the request does not matter, just the hour and all the times have already been rounded down to not include minutes) for the datetimeindex and instead return
count
hour
0 2
01 2
02 2
10 1
23 3
Any help would be much appreciated.
I have tried
new_dh.groupby([new_dh.index.hour]).count()
but find myself printing many columns of the same value whereas I only want the above version
If need DatetimeIndex in output use DataFrame.resample:
new_dh.resample('H')['s-sitename'].count()
Or DatetimeIndex.floor:
new_dh.groupby(new_dh.index.floor('H'))['s-sitename'].count()
Problem of your solution is if use GroupBy.count it count all columns value per Hours with exclude missing values, so if no missing values get multiple columns with same values. Possible solution is specify column after groupby:
new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
So data was changed for see how count with exclude missing values:
print (new_dh)
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 NaN 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 W3SVC1 0.0
df = new_dh.groupby([new_dh.index.hour]).count()
print (df)
s-sitename sc-win32-status
date_time
0 2 2
1 2 2
2 1 2
10 1 1
23 1 3
So if column is specified:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
print (s)
date_time
0 2
1 2
2 1
10 1
23 1
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].count().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 1
10 1
23 1
If need count also missing values then use GroupBy.size:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].size()
print (s)
date_time
0 2
1 2
2 2
10 1
23 3
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].size().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 2
10 1
23 3
new_dh['hour'] = new_dh.index.map(lambda x: x.hour)
new_dh.groupby('hour')['hour'].count()
Result
hour
0 2
1 2
2 2
10 1
23 3
Name: hour, dtype: int64
If you need a DataFrame as result:
new_dh.groupby('hour')['hour'].count().rename('count').to_frame()
In this case, the result will be:
count
hour
0 2
1 2
2 2
10 1
23 3
You can also do this by using groupby() and assign() method:
If 'date_time' column is not your index:
result=df.assign(hour=df['date_time'].dt.hour).groupby('hour').agg(count=('s-sitename','count'))
If It's your index then use:
result=df.groupby(df.index.hour)['s-sitename'].count().to_frame('count')
result.index.name='hour'
Now if you print result then you will get your desired output:
count
hour
0 1
1 2
2 2
10 1
23 3
I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?
I have time series data from "5 Jan 2015" to "28 Dec 2018". I observed some of the working days' dates and their values are missing. How to check how many weekdays are missing in my time range? and what are those dates so that i can extrapolate the values for those dates.
Example:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
On observing calendar, 21st Dec, 2018 was Friday. Then excluding Saturday and Sunday, the dataset should be having "24th Dec 2018" in the list, but its missing. I need to identify such missing dates from range.
My approach till now:
I tried using
pd.date_range('2015-01-05','2018-12-28',freq='W')
to identify the number of weeks and calculate the no. of weekdays from them manually, to identify number of missing dates.
But it dint solved purpose as I need to identify missing dates from range.
Let's say this is your full dataset:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
And dates were:
dates = pd.date_range('2018-12-15', '2018-12-31')
First, make sure the Date column is actually a date type:
df['Date'] = pd.to_datetime(df['Date'])
Then set Date as the index:
df = df.set_index('Date')
Then reindex with unutbu's solution:
df = df.reindex(dates, fill_value=0.0)
Then reset the index to make it easier to work with:
df = df.reset_index()
It now looks like this:
index Price Volume
0 2018-12-15 0.0 0.0
1 2018-12-16 0.0 0.0
2 2018-12-17 0.0 0.0
3 2018-12-18 0.0 0.0
4 2018-12-19 0.0 0.0
5 2018-12-20 0.0 0.0
6 2018-12-21 172.8 800.0
7 2018-12-22 0.0 0.0
8 2018-12-23 0.0 0.0
9 2018-12-24 0.0 0.0
10 2018-12-25 171.0 2200.0
11 2018-12-26 170.4 500.0
12 2018-12-27 173.6 400.0
13 2018-12-28 172.0 800.0
14 2018-12-29 0.0 0.0
15 2018-12-30 0.0 0.0
16 2018-12-31 0.0 0.0
Do:
df['weekday'] = df['index'].dt.dayofweek
Finally, how many weekdays are missing in your time range:
missing_weekdays = df[(~df['weekday'].isin([5,6])) & (df['Volume'] == 0.0)]
Result:
>>> missing_weekdays
index Price Volume weekday
2 2018-12-17 0.0 0.0 0
3 2018-12-18 0.0 0.0 1
4 2018-12-19 0.0 0.0 2
5 2018-12-20 0.0 0.0 3
9 2018-12-24 0.0 0.0 0
16 2018-12-31 0.0 0.0 0