Modify Code from Year Timespan to Month/Week Timespans - python

I am making a stacked bar plot over a year time span where the x-axis is company names, y-axis is the number of calls, and the stacks are the months.
I want to be able to make this plot run for a time span of a month, where the stacks are days, and a time span of a week, where the stacks are days. I am having trouble doing this since my code is built already around the year time span.
My input is a dataframe that looks like this
pivot_table.head(3)
Out[12]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
CompanyName
Customer1 17 30 29 39 15 26 24 12 36 21 18 15
Customer2 4 11 13 22 35 29 15 18 29 31 17 14
Customer3 11 8 25 24 7 15 20 0 21 12 12 17
and my code is this so far.
First I grab a years worth of data (I would change this to a month or a week for this question)
# filter by countries with at least one medal and sort
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - pd.DateOffset(years=1)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
Then I create the pivot_table shown above.
###########################################################
#Create Dataframe
###########################################################
df = df.set_index('recvd_dttm')
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg(len).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
pivot_table = result.pivot(index='Month', columns='CompanyName', values='NumberCalls').fillna(0)
s = pivot_table.sum().sort(ascending=False,inplace=False)
pivot_table = pivot_table.ix[:,s.index[:30]]
pivot_table = pivot_table.transpose()
pivot_table = pivot_table.reset_index()
pivot_table['CompanyName'] = [str(x) for x in pivot_table['CompanyName']]
Companies = list(pivot_table['CompanyName'])
pivot_table = pivot_table.set_index('CompanyName')
pivot_table.to_csv('pivot_table.csv')
Then I use the pivot table to create an OrderedDict for Plotting
###########################################################
#Create OrderedDict for plotting
###########################################################
months = [pivot_table[(m)].astype(float).values for m in range(1, 13)]
names = ["Jan", "Feb", "Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov", "Dec"]
months_dict = OrderedDict(list(zip(names, months)))
###########################################################
#Plot!
###########################################################
palette = brewer["RdYlGn"][8]
hover = HoverTool(
tooltips = [
("Month", "#months"),
("Number of Calls", "#NumberCalls"),
]
)
output_file("stacked_bar.html")
bar = Bar(months_dict, Companies, title="Number of Calls Each Month", palette = palette, legend = "top_right", width = 1200, height=900, stacked=True)
bar.add_tools(hover)
show(bar)
Does anyone have ideas on how to approach modifying this code so it can work for shorter time spans? This is what the graph looks like for a year
EDIT Added the full code. Input looks like this example:
CompanyName recvd_dttm
Company1 6/5/2015 18:28:50 PM
Company2 6/5/2015 14:25:43 PM
Company3 9/10/2015 21:45:12 PM
Company4 6/5/2015 14:30:43 PM
Company5 6/5/2015 14:32:33 PM

Related

How to find the number of days in each month between two date in different years

I'm trying to get the number of day between two days but per each month.
I found some answers but I can't figure out how to do it when the dates have two different years.
For example, I have this dataframe:
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2020-01-30' ,'2020-02-02','2020-03-03','2020-03-03','2020-02-02']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
And I want to get this dataframe:
s = (df[["StartDate", "EndDate"]]
.apply(lambda row: pd.date_range(row.StartDate, row.EndDate), axis=1)
.explode())
new = (s.groupby([s.index, s.dt.year, s.dt.month])
.count()
.unstack(level=[1, 2], fill_value=0))
new.columns = new.columns.map(lambda c: f"{c[0]}-{str(c[1]).zfill(2)}")
new = new.sort_index(axis="columns")
get all the dates in between StartDate and EndDate per row, and explode that list of dates to their own rows
group by the row id, year and month & count records
unstack the year & month identifier to be on the columns side as a multiindex
join that year & month values with a hypen in between (also zerofill months, e.g., 03)
lastly sort the year-month pairs on columns
to get
>>> new
2019-11 2019-12 2020-01 2020-02 2020-03
0 0 22 30 0 0
1 0 31 31 2 0
2 0 31 31 29 3
3 21 31 31 29 3
4 9 31 31 2 0

How to calculate monthly and weekly averages from a dataframe using python?

The below is my dataframe. How to calculate both monthly and weekly averages from this dataframe in python? I need to print month start&end and week start&end then the average of the month and week
**Input
SAMPLE DATASET**
kpi_id kpi_name run_date value
1 MTTR 5/17/2021 15
2 MTTR 5/18/2021 16
3 MTTR 5/19/2021 17
4 MTTR 5/20/2021 18
5 MTTR 5/21/2021 19
6 MTTR 5/22/2021 20
7 MTTR 5/23/2021 21
8 MTTR 5/24/2021 22
9 MTTR 5/25/2021 23
10 MTTR 5/26/2021 24
11 MTTR 5/27/2021 25
**expected output**
**monthly_mean**
kpi_name month_start month_end value(mean)
MTTR 5/1/2021 5/31/2021 20
**weekly_mean**
kpi_name week_start week_end value(mean)
MTTR 5/17/2021 5/23/2021 18
MTTR 5/24/2021 5/30/2021 23.5
groupby is your friend
monthly = df.groupby(pd.Grouper(key='run_date', freq='M')).mean()
weekly = df.groupby(pd.Grouper(key='run_date', freq='W')).mean()
Extending the answer from igrolvr to match your expected output:
# transformation from string to datetime for groupby
df['run_date'] = pd.to_datetime(df['run_date'], format='%m/%d/%Y')
# monthly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
monthly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='M')])['value'].mean().reset_index()
# Getting the start of month
monthly['month_start'] = monthly['run_date'] - pd.offsets.MonthBegin(1)
# Renaming the run_date column to month_end
monthly = monthly.rename({'run_date': 'month_end'}, axis=1)
print(monthly)
# weekly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
weekly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='W')])['value'].mean().reset_index()
# Getting the start of week and adding one day,
# because start of week is sunday
weekly['week_start'] = weekly['run_date'] - pd.offsets.Week(1) + pd.offsets.Day(1)
# Renaming the run_date column to week_end
weekly = weekly.rename({'run_date': 'week_end'}, axis=1)
print(weekly)

Pandas - seaborn lineplot hue unexpected legend

I have a data frame of client names, dates and transactions. I'm not sure how far back my error goes, so here is all the pre-processing I do:
data = pd.read_excel('Test.xls')
## convert to datetime object
data['Date Order'] = pd.to_datetime(data['Date Order'], format = '%d.%m.%Y')
## add columns for month and year of each row for easier analysis later
data['month'] = data['Date Order'].dt.month
data['year'] = data['Date Order'].dt.year
So the data frame becomes something like:
Date Order NameCustomers SumOrder month year
2019-01-02 00:00:00 Customer 1 290 1 2019
2019-02-02 00:00:00 Customer 1 50 2 2019
-----
2020-06-28 00:00:00 Customer 2 900 6 2020
------
..etc.
You get the idea. Next I group by both month and year and calculate the mean.
groupedMonthYearMean = data.groupby(['month', 'year'])['SumOrder'].mean().reset_index()
Output:
month year SumOrder
1 2019 233.08
1 2020 303.40
2 2019 255.34
2 2020 842.24
--------------------------
I use the resulting dataframe to make a lineplot, which tracks the SumOrder for each month, and displays it for each year.
linechart = sns.lineplot(x = 'month',
y = 'SumOrder',
hue = 'year',
data = groupedMonthYearMean).set_title('Mean Sum Order by month')
plt.show()
I have attached a screenshot of the resulting plot - overall it seems to show what I expected to create.
In my entire data, the 'year' column has only two values: 2019 and 2020. For some reason, whatever I do, they show up as 0, -1 and -2. Any ideas what is going on?
You want to change the dtype of the year column from int to category
df['year'] = df['year'].astype('category')
This is due to how hue treats ints.

Subsetting out rows using pandas

I have two sets of dataframes: datamax, datamax2015 and datamin, datamin2015.
Snippet of data:
print(datamax.head())
print(datamin.head())
print(datamax2015.head())
print(datamin2015.head())
Date ID Element Data_Value
0 2005-01-01 USW00094889 TMAX 156
1 2005-01-02 USW00094889 TMAX 139
2 2005-01-03 USW00094889 TMAX 133
3 2005-01-04 USW00094889 TMAX 39
4 2005-01-05 USW00094889 TMAX 33
Date ID Element Data_Value
0 2005-01-01 USC00200032 TMIN -56
1 2005-01-02 USC00200032 TMIN -56
2 2005-01-03 USC00200032 TMIN 0
3 2005-01-04 USC00200032 TMIN -39
4 2005-01-05 USC00200032 TMIN -94
Date ID Element Data_Value
0 2015-01-01 USW00094889 TMAX 11
1 2015-01-02 USW00094889 TMAX 39
2 2015-01-03 USW00014853 TMAX 39
3 2015-01-04 USW00094889 TMAX 44
4 2015-01-05 USW00094889 TMAX 28
Date ID Element Data_Value
0 2015-01-01 USC00200032 TMIN -133
1 2015-01-02 USC00200032 TMIN -122
2 2015-01-03 USC00200032 TMIN -67
3 2015-01-04 USC00200032 TMIN -88
4 2015-01-05 USC00200032 TMIN -155
For datamax, datamax2015, I want to compare their Data_Value columns and create a dataframe of entries in datamax2015 whose Data_Value is greater than all entries in datamax for the same day of the year. Thus, the expected output should be a dataframe with rows from 2015-01-01 to 2015-12-31 but with dates only where the values in the Data_Value column are greater than those in the Data_Value column of the datamax dataframe.
i.e 4 rows and anything from 1 to 364 columns depending on the condition above.
I want the converse (min) for the datamin and datamin2015 dataframes.
I have tried the following code:
upper = []
for row in datamax.iterrows():
for j in datamax2015["Data_Value"]:
if j > row["Data_Value"]:
upper.append(row)
lower = []
for row in datamin.iterrows():
for j in datamin2015["Data_Value"]:
if j < row["Data_Value"]:
lower.append(row)
Could anyone give me a helping hand as to where I am going wrong?
This code does what you want for the datamin. Try to adapt it to the datamax symmetric case as well - leave a comment if you have trouble and happy to help further.
Create Data
from datetime import datetime
import pandas as pd
datamin = pd.DataFrame({"date": pd.date_range(start=datetime(2005, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 1})
datamin["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year higher in order for the desired result to be non-empty
datamin.loc[datamin["day_of_year"]==4, "Data_Value"] = 2
datamin2015 = pd.DataFrame({"date": pd.date_range(start=datetime(2015, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 2})
datamin2015["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year lower in order for the desired result to be non-empty
datamin2015.loc[3, "Data_Value"] = 1
The solution
df1 = datamin.groupby("day_of_year").agg({"Data_Value": "min"})
df2 = datamin2015.join(df1, on="day_of_year", how="left", lsuffix="2015")
lower = df2.loc[df2["Data_Value2015"]<df2["Data_Value"]]
lower
We group the datamin by day of year to find the min across all the years for each day of the year (using .dt.dayofyear). Then we join that with datamin2015 and finally can then compare the Data_Value2015 with Data_Value in order to find the indexes of the rows where the Data_Value in 2015 was less than the minimum across all same days of the year in datamin.
In the example above lower has 1 row by the way I set up the dataframes.
Python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
Remove leap year dates (i.e. 29th February).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.max_rows",None,"display.max_columns",None)
data = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
newdata = data[(data['Date'] >= '2005-01-01') & (data['Date'] <= '2014-12-12')]
datamax = newdata[newdata['Element']=='TMAX']
datamin = newdata[newdata['Element']=='TMIN']
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
datamax["day_of_year"] = datamax["Date"].dt.dayofyear
datamax = datamax.groupby('day_of_year').max()
datamin["day_of_year"] = datamin["Date"].dt.dayofyear
datamin = datamin.groupby('day_of_year').min()
datamax = datamax.reset_index()
datamin = datamin.reset_index()
datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d')
datamax['Date'] = datamax['Date'].dt.strftime('%Y-%m-%d')
datamax = datamax[~datamax['Date'].str.contains("02-29")]
datamin = datamin[~datamin['Date'].str.contains("02-29")]
breakoutdata = data[(data['Date'] > '2014-12-31')]
datamax2015 = breakoutdata[breakoutdata['Element']=='TMAX']
datamin2015 = breakoutdata[breakoutdata['Element']=='TMIN']
datamax2015['Date'] = pd.to_datetime(datamax2015['Date'])
datamin2015['Date'] = pd.to_datetime(datamin2015['Date'])
datamax2015["day_of_year"] = datamax2015["Date"].dt.dayofyear
datamax2015 = datamax2015.groupby('day_of_year').max()
datamin2015["day_of_year"] = datamin2015["Date"].dt.dayofyear
datamin2015 = datamin2015.groupby('day_of_year').min()
datamax2015 = datamax2015.reset_index()
datamin2015 = datamin2015.reset_index()
datamin2015['Date'] = datamin2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015['Date'] = datamax2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015 = datamax2015[~datamax2015['Date'].str.contains("02-29")]
datamin2015 = datamin2015[~datamin2015['Date'].str.contains("02-29")]
dataminappend = datamin2015.join(datamin,on="day_of_year",rsuffix="_new")
lower = dataminappend.loc[dataminappend["Data_Value_new"]>dataminappend["Data_Value"]]
datamaxappend = datamax2015.join(datamax,on="day_of_year",rsuffix="_new")
upper = datamaxappend.loc[datamaxappend["Data_Value_new"]<datamaxappend["Data_Value"]]
upper['Date'] = pd.to_datetime(upper['Date'])
lower['Date'] = pd.to_datetime(lower['Date'])
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
ax = plt.gca()
plt.plot(datamax['day_of_year'],datamax['Data_Value'],color='red')
plt.plot(datamin['day_of_year'],datamin['Data_Value'], color='blue')
plt.scatter(upper['day_of_year'],upper['Data_Value'],color='purple')
plt.scatter(lower['day_of_year'],lower['Data_Value'], color='cyan')
plt.ylabel("Temperature (degrees C)",color='navy')
plt.xlabel("Date",color='navy',labelpad=15)
plt.title('Record high and low temperatures by day (2005-2014)', alpha=1.0,color='brown',y=1.08)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35),fancybox=False,labels=['Record high','Record low'])
plt.xticks(rotation=30)
plt.fill_between(range(len(datamax['Date'])), datamax['Data_Value'], datamin['Data_Value'],color='yellow',alpha=0.8)
plt.show()
I have converted the 'Date' column to a string using Datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d').
I have then converted this back to 'datetime' format using upper['Date'] = pd.to_datetime(upper['Date'])
I then used 'date of year' as the x-value.

Extracting date components in pandas series

I have problems with transforming a Pandas dataframe column with dates to a number.
import matplotlib.dates
import datetime
for x in arsenalchelsea['Datum']:
year = int(x[:4])
month = int(x[5:7])
day = int(x[8:10])
hour = int(x[11:13])
minute = int(x[14:16])
sec = int(x[17:19])
arsenalchelsea['floatdate']=date2num(datetime.datetime(year, month, day, hour, minute, sec))
arsenalchelsea
I want to make a new column in my dataframe with the dates in numbers, because i want to make a line graph later with the date on the x-as.
This is the format of the date:
2017-11-29 14:06:45
Does anyone have a solution for this problem?
Slicing strings to get date components is bad practice. You should convert to datetime and extract directly.
In this case, it seems you can just use pd.to_datetime, but below I also demonstrate how you can extract the various components once you have performed the conversion.
df = pd.DataFrame({'Date': ['2017-01-15 14:55:42', '2017-11-10 12:15:21', '2017-12-05 22:05:45']})
df['Date'] = pd.to_datetime(df['Date'])
df[['year', 'month', 'day', 'hour', 'minute', 'sec']] = \
df['Date'].apply(lambda x: (x.year, x.month, x.day, x.hour, x.minute, x.second)).apply(pd.Series)
Result:
Date year month day hour minute sec
0 2017-01-15 14:55:42 2017 1 15 14 55 42
1 2017-11-10 12:15:21 2017 11 10 12 15 21
2 2017-12-05 22:05:45 2017 12 5 22 5 45

Categories