How to plot months of multiple years of a variable - python

hey i have the following dataset
Columns of the dataset
i already converted the orderdate column into a datetime column
now i want to plot the sales per month of each year showing month on x axis and sales on Y
df_grouped = df_clean.groupby(by = "ORDERDATE").sum()
how can i achive to just pull out data from each month in a specific year ?
thanks for helping out!

You can create two additional columns from ORDERDATE, which are year and month by:
df_clean['year'] = df_clear['ORDERDATE'].dt.year
df_clean['month'] = df_clear['ORDERDATE'].dt.month
And then you can filter and group by these columns.
For example:
df_2022 = df_clean.loc[df_clean['year'] == '2022', :]
df_2022.groupby('month').sum()

Related

How to slice a pandas DataFrame between two dates (day/month) ignoring the year?

I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1

calculating daily average across 30 years for different variables

I'm working with a dataframe that has daily information (measured data) across 30 years for different variables. I am trying to groupby days of the year, and then find a mean across 30 years. How do I go about this? This is what the dataframe looks like
I tried to groupby day after checking for type of YYYYMMDD (it's an int64 type.) now I have the dataframe looking like this. It has just added new columns for Day, Month year
[]
I'm a bit stuck on how to calculate means from here, i would need to somehow group all Jan-1sts, jan-2nds etc over 30 years and average it after.
You can groupby with month and day:
df.index = pd.to_datetime(df.index)
( df.groupby([df.index.month, df.index.day]).mean().reset_index().
rename({'level_0':'month', 'level_1':'day'}, axis=1))
or if you want to group them by the day of year, i.e. 1, 2, .. 365, set as_index=False:
df.groupby([df.index.month, df.index.day], as_index=False).mean()

dataframe: how to get columns of Period objects (calendar+fiscal year and month)from DatetimeIndex?

I have a dataframe and these are the first 5 index, there are several rows with different datapoint for a date and then it goes to the next day
DatetimeIndex(['2014-01-01', '2014-01-01', '2014-01-01', '2014-01-01',
'2014-01-01'],
dtype='datetime64[ns]', name='DayStartedOn', freq=None)
and this is the current column dtypes
country object
type object
name object
injection float64
withdrawal float64
cy_month period[M]
I wish to add a column with calendar year month, and 2 columns with different fiscal years and months.
better to separate year and month in different columns like: calendar year, calendar month, fiscal year, fiscal month. The objective is to keep these column values when I perform regroup or resample with other columns
I achieved above cy_month by
df['cy_month']=df.index.to_period('M')
even I don't feel very comfortable with this, as I want the period, not the monthend
I tried to add these 2 columns
for calendar year:
pd.Period(df_storage_clean.index.year, freq='A-DEC')
for another fiscal year:
pd.Period(df_storage_clean.index.year, freq='A-SEP')
but had Traceback:
ValueError: Value must be Period, string, integer, or datetime
So I started to NOT using pandas by loop row by row and add to a list,
lst_period_cy=[]
for y in lst_cy:
period_cy=pd.Period(y, freq='A-DEC')
lst_period_cy.append(period_cy)
then convert the list to a Series or df and add it back to the df
but I suppose it's not efficient (150k rows data) so haven't continued
Just in case you haven't found a solution yet ...
You could do the following:
df.reset_index(drop=False, inplace=True)
df['cal_year_month'] = df.DayStartedOn.dt.month
df['cal_year'] = df.DayStartedOn.dt.year
df['fisc_year'] = df.DayStartedOn.apply(pd.Period, freq='A-SEP')
df.set_index('DayStartedOn', drop=True, inplace=True)
My assumption is that, as in your example, the index is named DayStartedOn. If that's not the case then the code has to be adjusted accordingly.

How do you plot month and year data to bar chart in matplotlib?

I have data like this that I want to plot by month and year using matplotlib.
df = pd.DataFrame({'date':['2018-10-01', '2018-10-05', '2018-10-20','2018-10-21','2018-12-06',
'2018-12-16', '2018-12-27', '2019-01-08','2019-01-10','2019-01-11',
'2019-01-12', '2019-01-13', '2019-01-25', '2019-02-01','2019-02-25',
'2019-04-05','2019-05-05','2018-05-07','2019-05-09','2019-05-10'],
'counts':[10,5,6,1,2,
5,7,20,30,8,
9,1,10,12,50,
8,3,10,40,4]})
First, I converted the datetime format, and get the year and month from each date.
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Then, I tried to do groupby like this.
aggmonth = df.groupby(['year', 'month']).sum()
And I want to visualize it in a barchart or something like that. But as you notice above, there are missing months in between the data. I want those missing months to be filled with 0s. I don't know how to do that in a dataframe like this. Previously, I asked this question about filling missing dates in a period of data. where I converted the dates to period range in month-year format.
by_month = pd.to_datetime(df['date']).dt.to_period('M').value_counts().sort_index()
by_month.index = pd.PeriodIndex(by_month.index)
df_month = by_month.rename_axis('month').reset_index(name='counts')
df_month
idx = pd.period_range(df_month['month'].min(), df_month['month'].max(), freq='M')
s = df_month.set_index('month').reindex(idx, fill_value=0)
s
But when I tried to plot s using matplotlib, it returned an error. It turned out you cannot plot a period data using matplotlib.
So basically I got these two ideas in my head, but both are stuck, and I don't know which one I should keep pursuing to get the result I want.
What is the best way to do this? Thanks.
Convert the date column to pandas datetime series, then use groupby on monthly period and aggregate the data using sum, next use DataFrame.resample on the aggregated dataframe to resample using monthly frequency:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
df1 = df1.resample('M').asfreq().fillna(0)
Plotting the data:
df1.plot(kind='bar')

Group Pandas Dataframe by Year and Month

I have the following dataframe dft with two columns 'DATE' and 'Income'
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
I am now trying to sum the data up for each month of each year. This would mean the new dataframe has two columns like Jan 2012 and then the income for that month in that year. I can do this for just a month by using the following code but this doesn't take into account the year that month sits in. Is there a way I can groupby month and year?
monthlyincome = dft.groupby(dft['DATE'].dt.strftime('%B'))
[['Income']].sum().reset_index()
The end goal is to then put this into a bar chart. I was thinking converting into two lists and then using something like:
plt.bar(xaxis,yaxis)
How can I get this to work?
Final Solution was:
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
group = group.reset_index()
Thanks to Mayank.
Try this:
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
This should return you year and month combined.

Categories