Grouped dates are 30 days behind for Plotly Express line graphs - python

I have a list of daily transactions that I am trying to plot on a line graph. I decided to group by month and year and sum those groupings. The data plots on the Plotly line graph as expected except the end dates are 30 days behind. This makes it difficult if I want to add/subtract the dates to obtain a certain date range.
To get a certain date range, I am not using the grouped dates but the original dates and applying relativedelta. How can I resolve this?
import pandas as pd
from datetime import datetime, timedelta
import plotly.express as px
import sqlite3
import numpy as np
from dateutil.relativedelta import relativedelta
data = {
'Transaction_type':[ 'Debit', 'Debit', 'Credit','Debit','Debit','Debit', 'Debit', 'Credit','Debit','Debit'],
'Amount': [40,150,1000,60,80,120, 80, 1000,500,80]
}
df = pd.DataFrame(data)
df['Date'] = pd.date_range(start='6/1/2022',end='7/30/2022', periods = len(df))
df['Date'] = pd.to_datetime(df['Date'])
df['year_month'] = df['Date'].dt.strftime('%Y-%m')
#Income Expense Visual
Income_Expense = df.copy()
Income_Expense.Transaction_type.replace(['credit'], 'Income', inplace= True) #Change to Income for line legend
Income_Expense.Transaction_type.replace(['debit'], 'Expense', inplace= True) #Change to Expense for line legend
Income_Expense = pd.pivot_table(Income_Expense, values = ['Amount'], index = ['Transaction_type', 'year_month'],aggfunc=sum).reset_index()
scatter_plot = px.line(Income_Expense, x = 'year_month', y = 'Amount', color = 'Transaction_type', title = 'Income and Expense', color_discrete_sequence= ['red','green'],
category_orders= {'Cash Flow': ['Expense', 'Income']})
scatter_plot.update_layout(legend_traceorder = 'reversed')
scatter_plot.update_layout(yaxis_tickformat = ',')
scatter_plot.show()

The reason for the error is the strftime(). This will convert your date to a string. From that point onwards, plotly thinks of each date as a string. So, the names are not as you may want it. You can do a Income_Expense.info() to check
So, you need to leave the dates in the datetime format. pandas Grouper can be used to group the dates by monthly frequency. You can then plot it and specify the date format so that plotly understands that these are dates. Below is the updated code.
Note that Date needs to be in index for grouper to work. So, first I do this by the set_index(), then use the grouper with frequency as month along Transaction type, then do a sum and reset_index. This will create a dataframe that looks like the one you had, except that these are now datetime, not strings.
import pandas as pd
from datetime import datetime, timedelta
import plotly.express as px
import sqlite3
import numpy as np
from dateutil.relativedelta import relativedelta
data = {'Transaction_type':[ 'Debit', 'Debit', 'Credit','Debit','Debit','Debit', 'Debit', 'Credit','Debit','Debit'], 'Amount': [40,150,1000,60,80,120, 80, 1000,500,80]}
df = pd.DataFrame(data)
df['Date'] = pd.date_range(start='6/1/2022',end='7/30/2022', periods = len(df))
df['Date'] = pd.to_datetime(df['Date'])
df['year_month'] = df['Date'].dt.strftime('%Y-%m')
#Income Expense Visual
Income_Expense = df.copy()
Income_Expense.Transaction_type.replace(['credit'], 'Income', inplace= True) #Change to Income for line legend
Income_Expense.Transaction_type.replace(['debit'], 'Expense', inplace= True) #Change to Expense for line legend
Income_Expense = df.set_index('Date').groupby([pd.Grouper(freq="M"), 'Transaction_type']).sum().reset_index()
scatter_plot = px.line(Income_Expense, x = 'Date', y = 'Amount', color = 'Transaction_type', title = 'Income and Expense', color_discrete_sequence= ['red','green'],
category_orders= {'Cash Flow': ['Expense', 'Income']})
scatter_plot.update_layout(legend_traceorder = 'reversed')
scatter_plot.update_layout(yaxis_tickformat = ',')
scatter_plot.update_xaxes(tickformat="%d-%b-%Y")
scatter_plot.show()

Related

Get the first and the last day of a month from the df

This is how my dataframe looks like:
datetime open high low close
2006-01-02 4566.95 4601.35 4542.00 4556.25
2006-01-03 4531.45 4605.45 4531.45 4600.25
2006-01-04 4619.55 4707.60 4616.05 4694.14
.
.
.
Need to calculate the Monthly Returns in %
Formula: (Month Closing Price - Month Open Price) / Month Open Price
I can't seem to get the open price and closing price of a month, because in my df most months dont have a log for the 1st of the month. So having trouble calculating it.
Any help would be very much appreciated!
You need to use groupby and agg function in order to get the first and last value of each column in each month:
import pandas as pd
df = pd.read_csv("dt.txt")
df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index("datetime", inplace=True)
resultDf = df.groupby([df.index.year, df.index.month]).agg(["first", "last"])
resultDf["new_column"] = (resultDf[("close", "last")] - resultDf[("open", "first")])/resultDf[("open", "first")]
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf
The code above will result in a dataframe that has multiindex column. So, if you want to get, for example, rows with year of 2010, you can do something like:
resultDf[resultDf["year"] == 2010]
You can create a custom grouper such as follow :
import pandas as pd
import numpy as np
from io import StringIO
csvfile = StringIO(
"""datetime\topen\thigh\tlow\tclose
2006-01-02\t4566.95\t4601.35\t4542.00\t4556.25
2006-01-03\t4531.45\t4605.45\t4531.45\t4600.25
2006-01-04\t4619.55\t4707.60\t4616.05\t4694.14""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.datetime = pd.to_datetime(df.datetime, format = "%Y-%m-%d")
dg = df.groupby(pd.Grouper(key='datetime', axis=0, freq='M'))
Then each group of dg is separate by month, and since we convert datetime as pandas.datetime we can use classic arithmetic on it :
def monthly_return(datetime, close_value, open_value):
index_start = np.argmin(datetime)
index_end = np.argmax(datetime)
return (close_value[index_end] - open_value[index_start]) / open_value[index_start]
dg.apply(lambda x : monthly_return(x.datetime, x.close, x.open))
Out[97]:
datetime
2006-01-31 0.02785
Freq: M, dtype: float64
Of course a pure functional approach is possible instead of using monthly_return function

manipulate Date from yfinance

When I pull stock data from yfinance, can I create other columns of data that manipulate the 'date' column? I am new to python and still learning a lot. I have created other columns using the stock price data, but I cannot figure out how to manipulate the 'date' column.
For example, 10/26/2020, I would like to create columns with the following data:
day_of_week, Monday = 1
year = 2020
month = 10
day = 26
week = 44
trade_day = 207
import pandas as pd
import numpy as np
import yfinance as yf
import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
##Get stock price data
ticker = 'NVDA'
#Data time period
now = dt.datetime.now()
startyear = 2017
startmonth=1
startday=1
start = dt.datetime(startyear, startmonth, startday)
#get data from YFinance
df = pdr.get_data_yahoo(ticker, start, now)
#create a column
df['% Change'] = (df['Adj Close'] / df['Adj Close'].shift(1))-1
df['Range'] = df['High'] - df['Low']
df
You want to use the index of your dataframe, which is of type pd.DatetimeIndex.
To split the date into new columns:
new_df = df.copy()
new_df['year'], new_df['month'], new_df['day'] = df.index.year, df.index.month, df.index.day
To carry up arithmetic operations from the first trade date:
start_date = df.index.min()
new_df['trade_day'] = df.index.day - start_date.day
new_df['trade_week'] = df.index.week - start_date.week
new_df['trade_year'] = df.index.year - start_date.year
new_df['day_of_week'] = df.index.weekday
new_df['days_in_month'] = df.index.days_in_month
new_df['day_name'] = df.index.day_name()
new_df['month_name'] = df.index.month_name()
Choose another start date
start_date = pd.to_datetime('2017-01-01')
I did figure out most of the problem. I cannot figure out how to calculate the 'trade date'.
#Convert the 'Date' Index to 'Date' Column
df.reset_index(inplace=True)
#Create columns manipulating 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Week of Year'] = df['Date'].dt.isocalendar().week
df['Day of Week'] = df['Date'].dt.dayofweek

Adding rows to pandas dataframe with date range, created_at and today, python

I have a dataframe dataframe consisting of two columns, customer_id and a date column, created_at.
I wish to add another row for each month the customer remains in the customer base.
For example, if the customer_id was created during July, the dataframe would add 4 additional rows for that customer, between the range of "created_at" and "today". For example; for customer1 I would have 9 rows, one for each month up to day, for customer2: 7 rows, and customer3: 4 rows. I was thinking of maybe something like I've copied below, with the idea of merging df with seqDates...
import pandas as pd
import numpy as np
df = pd.DataFrame([("customer1", "05-02-2020"), ("customer2","05-04-2020"), ("customer3","04-07-2020")], index=["1","2","3"], columns= ("customer_id","created_at"))
df["created_at"] = pd.to_datetime(df["created_at"])
# create month expansion column
start = min(df["created_at"])
end = pd.to_datetime("today")
seqDates = pd.date_range(start, end, freq="D")
seqDates = pd.DataFrame(seqDates)
columns = ["created_at"]
Try this:
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from dateutil import rrule, parser
outList = []
operations_date = datetime.datetime.now().date()
dfDict = df.to_dict(orient='records')
for aDict in dfDict:
created_at = aDict['created_at']
start_date = datetime.datetime.strptime(created_at, '%d-%m-%Y').date() -
relativedelta(months = 1)
end_date = parser.parse(str(operations_date))
date_range = list(rrule.rrule(rrule.MONTHLY, bymonthday=1, dtstart=start_date,
until=end_date))
for aDate in date_range:
outList.append({'customer_id' : aDict['customer_id'], 'created_at' : aDate})
df = pd.DataFrame(outList)

Python: Time Series with Pandas

I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):
Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736
df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()
The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?
stepwise_fit = auto_arima(df[Series_name]....
pandas has a built in function pandas.infer_freq()
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'
Alternatively you could also make use of the datetime functionality of the columns.
df.Date.dt.freq
#'MS'
Of course if your data doesn't actually have a real frequency, then you won't get anything.
pd.infer_freq(df.Date3)
#
The frequency descriptions are docmented under offset-aliases.

Pivot pandas timeseries by year

Is there a shorter or more elegant way to pivot a timeseries by year in pandas? The code below does what I want but I wonder if there is a better way to accomplish this:
import pandas
import numpy
daterange = pandas.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pandas.Series(data=numpy.random.rand(daterange.size), index=daterange)
# group by year
wind_speed_groups = wind_speed.groupby(wind_speed.index.year).groups
# assemble data frame with columns of wind speed data for every year
wind_speed_pivot = pandas.DataFrame()
for key, group in wind_speed_groups.items():
series = wind_speed[group]
series.name = key
series.index = series.index - pandas.Timestamp(str(key)+'-01-01')
wind_speed_pivot = wind_speed_pivot.join(series, how='outer')
print(wind_speed_pivot)
I'm not sure if this is the fastest method, as I'm adding two columns to your initial dataframe (it's possible to add just one if you want to overwrite it).
import pandas as pd
import numpy as np
import datetime as dt
daterange = pd.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pd.Series(data=np.random.rand(daterange.size), index=daterange)
df = wind_speed.to_frame("windspeed")
df["year"] = df.index.year
df["pv_index"] = df.index - df["year"].apply(lambda x: dt.datetime(x,1,1))
wind_speed_pivot = df.pivot_table(index=["pv_index"], columns=["year"], values=["windspeed"])

Categories