I have the following dataframe dft with two columns 'DATE' and 'Income'
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
I am now trying to sum the data up for each month of each year. This would mean the new dataframe has two columns like Jan 2012 and then the income for that month in that year. I can do this for just a month by using the following code but this doesn't take into account the year that month sits in. Is there a way I can groupby month and year?
monthlyincome = dft.groupby(dft['DATE'].dt.strftime('%B'))
[['Income']].sum().reset_index()
The end goal is to then put this into a bar chart. I was thinking converting into two lists and then using something like:
plt.bar(xaxis,yaxis)
How can I get this to work?
Final Solution was:
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
group = group.reset_index()
Thanks to Mayank.
Try this:
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
This should return you year and month combined.
Related
I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1
hey i have the following dataset
Columns of the dataset
i already converted the orderdate column into a datetime column
now i want to plot the sales per month of each year showing month on x axis and sales on Y
df_grouped = df_clean.groupby(by = "ORDERDATE").sum()
how can i achive to just pull out data from each month in a specific year ?
thanks for helping out!
You can create two additional columns from ORDERDATE, which are year and month by:
df_clean['year'] = df_clear['ORDERDATE'].dt.year
df_clean['month'] = df_clear['ORDERDATE'].dt.month
And then you can filter and group by these columns.
For example:
df_2022 = df_clean.loc[df_clean['year'] == '2022', :]
df_2022.groupby('month').sum()
I am new to scripting need some help in writing the code in correct way. I have a csv file in which we have date based on the date I need to create a new column name period which will be combination of year and month.
If the date range is between 1 to 25, month will be the current month from the date
If the date range is greater then 25, month will be next month.
Sample file:
Date
10/21/2021
10/26/2021
01/26/2021
Expected results:
Date
Period (year+month)
10/21/2021
202110
10/26/2021
202111
01/26/2021
202102
Two ways I can think of.
Convert the incoming string into a date object and get the values you need from there. See Converting string into datetime
Use split("/") to split the date string into a list of three values and use those to do your calculations.
Good question.
I've included the code that I wrote to do this, below. The process we will follow is:
Load the data from a csv
Define a function that will calculate the period for each date
Apply the function to our data and store the result as a new column
import pandas as pd
# Step 1
# read in the data from a csv, parsing dates and store the data in a DataFrame
data = pd.read_csv("filepath.csv", parse_dates=["Date"])
# Create day, month and year columns in our DataFrame
data['day'] = data['Date'].dt.day
data['month'] = data['Date'].dt.month
data['year'] = data['Date'].dt.year
# Step 2
# Define a function that will get our periods from a given date
def get_period(date):
day = date.day
month = date.month
year = date.year
if day > 25:
if month == 12: # if december, increment year and change month to jan.
year += 1
month = 1
else:
month += 1
# convert our year and month into strings that we can concatenate easily
year_string = str(year).zfill(4) #
month_string = str(month).zfill(2)
period = str(year_string) + str(month_string) # concat the strings together
return period
# Step 3
# Apply our custom function (get_period) to the DataFrame
data['period'] = data.apply(get_period, axis = 1)
I have data like this that I want to plot by month and year using matplotlib.
df = pd.DataFrame({'date':['2018-10-01', '2018-10-05', '2018-10-20','2018-10-21','2018-12-06',
'2018-12-16', '2018-12-27', '2019-01-08','2019-01-10','2019-01-11',
'2019-01-12', '2019-01-13', '2019-01-25', '2019-02-01','2019-02-25',
'2019-04-05','2019-05-05','2018-05-07','2019-05-09','2019-05-10'],
'counts':[10,5,6,1,2,
5,7,20,30,8,
9,1,10,12,50,
8,3,10,40,4]})
First, I converted the datetime format, and get the year and month from each date.
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Then, I tried to do groupby like this.
aggmonth = df.groupby(['year', 'month']).sum()
And I want to visualize it in a barchart or something like that. But as you notice above, there are missing months in between the data. I want those missing months to be filled with 0s. I don't know how to do that in a dataframe like this. Previously, I asked this question about filling missing dates in a period of data. where I converted the dates to period range in month-year format.
by_month = pd.to_datetime(df['date']).dt.to_period('M').value_counts().sort_index()
by_month.index = pd.PeriodIndex(by_month.index)
df_month = by_month.rename_axis('month').reset_index(name='counts')
df_month
idx = pd.period_range(df_month['month'].min(), df_month['month'].max(), freq='M')
s = df_month.set_index('month').reindex(idx, fill_value=0)
s
But when I tried to plot s using matplotlib, it returned an error. It turned out you cannot plot a period data using matplotlib.
So basically I got these two ideas in my head, but both are stuck, and I don't know which one I should keep pursuing to get the result I want.
What is the best way to do this? Thanks.
Convert the date column to pandas datetime series, then use groupby on monthly period and aggregate the data using sum, next use DataFrame.resample on the aggregated dataframe to resample using monthly frequency:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
df1 = df1.resample('M').asfreq().fillna(0)
Plotting the data:
df1.plot(kind='bar')
I have a Pandas dataframe of the size (80219 * 5) with the same structure as the image I have uploaded. The data can range from 2002-2016 for each company but if missing values appear the data either starts at a later date or ends at an earlier date as you can see in the image.
What I would like to do is to calculate yearly compounded returns measured from June to June for each company. If there is no data for the specific company for the full 12 months period from June to June the result should be nan. Below is my current code, but I don't know how to calculate the returns from June to June.
After having loaded the file and cleaned it I:
df[['Returns']] = df[['Returns']].apply(pd.to_numeric)
df['Names Date'] = pd.to_datetime(df['Names Date'])
df['Returns'] = df['Returns']+ 1
df = df[['Company Name','Returns','Names Date']]
df['year']=df['Names Date'].dt.year
df['cum_return'] = df.groupby(['Company Name','year']).cumprod()
df = df.groupby(['Company Name','year']).nth(11)
print(tabulate(df, headers='firstrow', tablefmt='psql'))
Which calculates the annual return from 1st of january to 31st of december..
I finally found a way to do it. The easiest way I could find is to calculate a rolling 12 month compounded return for each month and then slice the dataframe for to give me the 12 month returns of the months I want:
def myfunc(arr):
return np.cumprod(arr)[-1]
cum_ret = pd.Series()
grouped = df.groupby('Company Name')
for name, group in grouped:
cum_ret = cum_ret.append(pd.rolling_apply(group['Returns'],12,myfunc))
df['Cum returns'] = cum_ret
df = df.loc[df['Names Date'].dt.month==6]
df['Names Date'] = df['Names Date'].dt.year