When I pull stock data from yfinance, can I create other columns of data that manipulate the 'date' column? I am new to python and still learning a lot. I have created other columns using the stock price data, but I cannot figure out how to manipulate the 'date' column.
For example, 10/26/2020, I would like to create columns with the following data:
day_of_week, Monday = 1
year = 2020
month = 10
day = 26
week = 44
trade_day = 207
import pandas as pd
import numpy as np
import yfinance as yf
import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
##Get stock price data
ticker = 'NVDA'
#Data time period
now = dt.datetime.now()
startyear = 2017
startmonth=1
startday=1
start = dt.datetime(startyear, startmonth, startday)
#get data from YFinance
df = pdr.get_data_yahoo(ticker, start, now)
#create a column
df['% Change'] = (df['Adj Close'] / df['Adj Close'].shift(1))-1
df['Range'] = df['High'] - df['Low']
df
You want to use the index of your dataframe, which is of type pd.DatetimeIndex.
To split the date into new columns:
new_df = df.copy()
new_df['year'], new_df['month'], new_df['day'] = df.index.year, df.index.month, df.index.day
To carry up arithmetic operations from the first trade date:
start_date = df.index.min()
new_df['trade_day'] = df.index.day - start_date.day
new_df['trade_week'] = df.index.week - start_date.week
new_df['trade_year'] = df.index.year - start_date.year
new_df['day_of_week'] = df.index.weekday
new_df['days_in_month'] = df.index.days_in_month
new_df['day_name'] = df.index.day_name()
new_df['month_name'] = df.index.month_name()
Choose another start date
start_date = pd.to_datetime('2017-01-01')
I did figure out most of the problem. I cannot figure out how to calculate the 'trade date'.
#Convert the 'Date' Index to 'Date' Column
df.reset_index(inplace=True)
#Create columns manipulating 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Week of Year'] = df['Date'].dt.isocalendar().week
df['Day of Week'] = df['Date'].dt.dayofweek
Related
I have a list of daily transactions that I am trying to plot on a line graph. I decided to group by month and year and sum those groupings. The data plots on the Plotly line graph as expected except the end dates are 30 days behind. This makes it difficult if I want to add/subtract the dates to obtain a certain date range.
To get a certain date range, I am not using the grouped dates but the original dates and applying relativedelta. How can I resolve this?
import pandas as pd
from datetime import datetime, timedelta
import plotly.express as px
import sqlite3
import numpy as np
from dateutil.relativedelta import relativedelta
data = {
'Transaction_type':[ 'Debit', 'Debit', 'Credit','Debit','Debit','Debit', 'Debit', 'Credit','Debit','Debit'],
'Amount': [40,150,1000,60,80,120, 80, 1000,500,80]
}
df = pd.DataFrame(data)
df['Date'] = pd.date_range(start='6/1/2022',end='7/30/2022', periods = len(df))
df['Date'] = pd.to_datetime(df['Date'])
df['year_month'] = df['Date'].dt.strftime('%Y-%m')
#Income Expense Visual
Income_Expense = df.copy()
Income_Expense.Transaction_type.replace(['credit'], 'Income', inplace= True) #Change to Income for line legend
Income_Expense.Transaction_type.replace(['debit'], 'Expense', inplace= True) #Change to Expense for line legend
Income_Expense = pd.pivot_table(Income_Expense, values = ['Amount'], index = ['Transaction_type', 'year_month'],aggfunc=sum).reset_index()
scatter_plot = px.line(Income_Expense, x = 'year_month', y = 'Amount', color = 'Transaction_type', title = 'Income and Expense', color_discrete_sequence= ['red','green'],
category_orders= {'Cash Flow': ['Expense', 'Income']})
scatter_plot.update_layout(legend_traceorder = 'reversed')
scatter_plot.update_layout(yaxis_tickformat = ',')
scatter_plot.show()
The reason for the error is the strftime(). This will convert your date to a string. From that point onwards, plotly thinks of each date as a string. So, the names are not as you may want it. You can do a Income_Expense.info() to check
So, you need to leave the dates in the datetime format. pandas Grouper can be used to group the dates by monthly frequency. You can then plot it and specify the date format so that plotly understands that these are dates. Below is the updated code.
Note that Date needs to be in index for grouper to work. So, first I do this by the set_index(), then use the grouper with frequency as month along Transaction type, then do a sum and reset_index. This will create a dataframe that looks like the one you had, except that these are now datetime, not strings.
import pandas as pd
from datetime import datetime, timedelta
import plotly.express as px
import sqlite3
import numpy as np
from dateutil.relativedelta import relativedelta
data = {'Transaction_type':[ 'Debit', 'Debit', 'Credit','Debit','Debit','Debit', 'Debit', 'Credit','Debit','Debit'], 'Amount': [40,150,1000,60,80,120, 80, 1000,500,80]}
df = pd.DataFrame(data)
df['Date'] = pd.date_range(start='6/1/2022',end='7/30/2022', periods = len(df))
df['Date'] = pd.to_datetime(df['Date'])
df['year_month'] = df['Date'].dt.strftime('%Y-%m')
#Income Expense Visual
Income_Expense = df.copy()
Income_Expense.Transaction_type.replace(['credit'], 'Income', inplace= True) #Change to Income for line legend
Income_Expense.Transaction_type.replace(['debit'], 'Expense', inplace= True) #Change to Expense for line legend
Income_Expense = df.set_index('Date').groupby([pd.Grouper(freq="M"), 'Transaction_type']).sum().reset_index()
scatter_plot = px.line(Income_Expense, x = 'Date', y = 'Amount', color = 'Transaction_type', title = 'Income and Expense', color_discrete_sequence= ['red','green'],
category_orders= {'Cash Flow': ['Expense', 'Income']})
scatter_plot.update_layout(legend_traceorder = 'reversed')
scatter_plot.update_layout(yaxis_tickformat = ',')
scatter_plot.update_xaxes(tickformat="%d-%b-%Y")
scatter_plot.show()
Could you please help me with the following tackle?
I need to remove the weekend days from the dataframe (attached link: dataframe_running_example. I can get a list of all the weekend days between mix and max date pulled out from the event however I cannot filter out the df based on "list_excluded" list.
from datetime import timedelta, date
import pandas as pd
#Data Loading
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
#start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
weekdays = [6,7]
#List comprehension
list_excluded = [dt for dt in daterange(start_dt, end_dt) if dt.isoweekday() in weekdays]
df.info()
df_excluded = pd.DataFrame(list_excluded).rename({0: 'timestamp_excluded'}, axis='columns')
df_excluded["ts_excluded"] = df_excluded["timestamp_excluded"].dt.date
df[~df["timestamp_date"].isin(df_excluded["ts_excluded"])]
ooh an issue has been resolved. I used pd.bdate_range() function.
from datetime import timedelta, date
import pandas as pd
import numpy as np
#Wczytanie danych
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
#Zakres timestamp: start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
bus_days = pd.bdate_range(start_dt, end_dt)
df["timestamp_date"] = pd.to_datetime(df["timestamp_date"])
df['Is_Business_Day'] = df['timestamp_date'].isin(bus_days)
df[df["Is_Business_Day"]!=False]
I have a dataframe dataframe consisting of two columns, customer_id and a date column, created_at.
I wish to add another row for each month the customer remains in the customer base.
For example, if the customer_id was created during July, the dataframe would add 4 additional rows for that customer, between the range of "created_at" and "today". For example; for customer1 I would have 9 rows, one for each month up to day, for customer2: 7 rows, and customer3: 4 rows. I was thinking of maybe something like I've copied below, with the idea of merging df with seqDates...
import pandas as pd
import numpy as np
df = pd.DataFrame([("customer1", "05-02-2020"), ("customer2","05-04-2020"), ("customer3","04-07-2020")], index=["1","2","3"], columns= ("customer_id","created_at"))
df["created_at"] = pd.to_datetime(df["created_at"])
# create month expansion column
start = min(df["created_at"])
end = pd.to_datetime("today")
seqDates = pd.date_range(start, end, freq="D")
seqDates = pd.DataFrame(seqDates)
columns = ["created_at"]
Try this:
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from dateutil import rrule, parser
outList = []
operations_date = datetime.datetime.now().date()
dfDict = df.to_dict(orient='records')
for aDict in dfDict:
created_at = aDict['created_at']
start_date = datetime.datetime.strptime(created_at, '%d-%m-%Y').date() -
relativedelta(months = 1)
end_date = parser.parse(str(operations_date))
date_range = list(rrule.rrule(rrule.MONTHLY, bymonthday=1, dtstart=start_date,
until=end_date))
for aDate in date_range:
outList.append({'customer_id' : aDict['customer_id'], 'created_at' : aDate})
df = pd.DataFrame(outList)
import pandas as pd
import pandas_datareader.data as web
from datetime import datetime
start_date = '2019-11-26'
end_date = str(datetime.now().strftime('%Y-%m-%d'))
tickers = ['IBM', 'AAPL','GOOG']
df = pd.concat([web.DataReader(ticker, 'yahoo', start_date, end_date) for ticker in tickers]).reset_index()
with pd.option_context('display.max_columns', 999):
print(df)
When I run my code, I can see only "Date High Low Open Close Volume Adj Close" values.
What I want to see is the stocks' names before the Date!
Please, help me out...
It always gives data without stocks' names so you have to add names before you concatenate data.
tickers = ['IBM', 'AAPL','GOOG']
data = []
for ticker in tickers:
df = web.DataReader(ticker, 'yahoo', start_date, end_date)
df['Name'] = ticker
data.append(df)
df = pd.concat(data).reset_index()
import pandas as pd
import numpy as np
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
infy = get_history(symbol='INFY', start = start, end = end)
infy.index = pd.to_datetime(infy.index)
infy.head()
infy_volume = infy.groupby(infy['Date'].dt.year).reset_index().Volume.sum().
"Error showed as Date", but Infy_volume should be a multi-index series
with two levels of index - Year and Month
.
Here you have the date column as index so use
infy.groupby(infy.index.year).Volume.sum().reset_index()
If you want to groupby with year and month use
infy_volume = infy.groupby([infy.index.year, infy.index.month]).Volume.sum()
infy_volume.index = infy_volume.index.rename('Month', level=1)
print(infy_volume)
# infy_volume.reset_index()