How to group by day and month in pandas? - python

Given a series like this
Date
2005-01-01 128
2005-01-02 72
2005-01-03 67
2005-01-04 61
2005-01-05 33
Name: Data_Value, dtype: int64
for several years, how do I group all the January 1sts together, all the January 2nds, etc?
I'm actually trying to find the max for each day of the year across several years, so it does not have to be groupby. If there is an easier way to do this, that would be great.

You can convert your index to datetime, then use strftime to get a date formatted string to group on:
df.groupby(pd.to_datetime(df.index).strftime('%b-%d'))['Date_Value'].max()
If there are no NaNs in your date string, you can slice as well. This returns strings of the format "MM-DD":
df.groupby(df.index.astype(str).str[5:])['Date_Value'].max()

Why to not just keep it simple!
max_temp = dfall.groupby([(dfall.Date.dt.month),(dfall.Date.dt.day)])['Data_Value'].max()

As an alternative, you can use a pivot table:
Reset index and format date columns
df=df.reset_index()
df['date']=pd.to_datetime(df['index'])
df['year']=df['date'].dt.year
df['month']=df['date'].dt.month
df['day']=df['date'].dt.day
Pivot over the month and day columns:
df_grouped=df.pivot_table(index=('month','day'),values='Date',aggfunc='max')

Related

python dataframe datetime condition

I am trying to create a new dataframe from an existing one by conditioning holiday datetime. train dataframe is existing and I want to create train_holiday from it by taking day and month values of holiday dataframe, my purpose is similar below:
date values
2015-02-01 10
2015-02-02 20
2015-02-03 30
2015-02-04 40
2015-02-05 50
2015-02-06 60
date
2012-02-02
2012-02-05
now first one is existing, and second dataframe shows holidays. I want to create a new dataframe from first one that only contains 2015 holidays similar below:
date values
2015-02-02 20
2015-02-05 50
I tried
train_holiday = train.loc[train["date"].dt.day== holidays["date"].dt.day]
but it gives error. could you please help me about this?
In your problem you care only the month and the day components, and one way to extract that is by dt.strftime() (ref). Applying that extraction on both date columns and use .isin() to keep month-day in df1 that matches that in df2.
df1[
df1['date'].dt.strftime('%m%d').isin(
df2['date'].dt.strftime('%m%d')
)
]
Make sure both date columns are in date-time format so that .dt can work. For example,
df1['date'] = pd.to_datetime(df1['date'])

How To Sum all the values of a column for a date instance in pandas

I am working on time-series data, where I have two columns date and quantity. The date is day wise. I want to add all the quantity for a month and convert it into a single date.
date is my index column
Example
quantity
date
2018-01-03 30
2018-01-05 45
2018-01-19 30
2018-02-09 10
2018-02-19 20
Output :
quantity
date
2018-01-01 105
2018-02-01 30
Thanks in advance!!
You can downsample to combine the data for each month and sum it by chaining the sum method.
df.resample("M").sum()
Check out the pandas user guide on resampling here.
You'll need to make sure your index is in datetime format for this to work. So first do: df.index = pd.to_datetime(df.index). Hat tip to sammywemmy for the same advice in the comments.
You an also use groupby to get results.
df.index = pd.to_datetime(df.index)
df.groupby(df.index.strftime('%Y-%m-01')).sum()

How to count nonzero occurrences based on another variable in python?

Date Precipitation
20010101 0
20010102 10
20010103 5
20010104 3
20010105 0
...
20011231 0
I have dataset showing precipitation (in) per each day in the year 2001. The date variable is in YYYYMMDD format. I want to calculate how many times it precipitated each month. In other words, I need the number of times the precipitation value is not 0 per each month.
I am a beginner python learner and don’t quite know how to tell the program to output the count per each month without having to do it individually.
The code I have below does not work because I’m not sure how to tell the program the Date variable is in YYYYMMDD format.
Precip_Count= Date[(Precipitation !=0)]
Is there a way to do this by only using NumPy?
First, convert Date column to datetime using pd.to_datetime and specify the format of your datetime string Datetime format code, then use Series.ne to find non-zero values, groupby month and take the sum using GroupBy.sum
df['Date'] = pd.to_datetime(df['Date'], format="%Y%M%d")
df['Precipitation'].ne(0).groupby(df.Date.dt.month).sum()
Date
1 3
...
12 0
Name: Precipitation, dtype: int64
OR using Series.dt.to_period here.
df['Precipitation'].ne(0).groupby(df.Date.dt.to_period('M')).sum()
Date
2001-01 3
...
2001-12 0
Freq: M, Name: Precipitation, dtype: int64
If you want index as DatetimeIndex use pd.Grouper
df['Precipitation'].ne(0).groupby(pd.Grouper(freq='M')).sum()
Date
2001-01-31 3
...
2001-12-31 0
Freq: M, Name: Precipitation, dtype: int64
The output is calculated from df mentioned in the question.

How to filter dateframe by end of business month ('BM') using datetime?

I'm trying to look at the adjusted close stock values of a particular stock at the end of the month. I was able to get a dataframe of dates and adjclose values, but I can't seem to be able to filter that dataframe to include only dates that are end of month and their corresponding adj close value.
apple_adjclose = apple_stock[['date','adjclose']]
this is the dataframe which includes dates for 2 years in the following format YYYY-MM-DD, and adjclose has float values. Help is really appreciated!
Sample picture of input and output I'm getting
(still haven't figured out how to put tables in my questions :)
Other attempt
Attempt 3
Solved here
Lets say you have a dataframe like this with two columns,
dates = pd.date_range('01/01/2016', '12/31/2017')
df = pd.DataFrame({'date':dates,'adjclose':np.random.randint(100,200, len(dates))})
You can create an instance of offsets BMonthEnd to get the dates with MonthEnd freq and slice of the dataframe
df.loc[df.date.isin(df.date + pd.offsets.BMonthEnd(1))]
adjclose date
28 128 2016-01-29
59 193 2016-02-29
90 167 2016-03-31
119 185 2016-04-29
151 133 2016-05-31
181 184 2016-06-30
converted date from object to datetime then used .asfreq() to get what I needed. Solution can be found here:
Solution

Extracting just Month and Year separately from Pandas Datetime column

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Categories