Selecting specific date from pandas data-frame - python

From the daily stock price data, I want to sample and select end of the month price. I am accomplishing using the following code.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin=end-pd.DateOffset(365*2)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2])).set_index(data.index)
The line above selects end of the month data and here is the output.
If I want to select penultimate value of the month, I can do it using the following code.
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2]))
Here is the output.
However the index shows end of the month value. When I choose penultimate value of the month, I want index to be 2015-12-30 instead of 2015-12-31.
Please suggest the way forward. I hope my question is clear.
Thanking you in anticipation.
Regards,
Abhishek

I am not sure if there is a way to do it with resample. But, you can get what you want using groupby and TimeGrouper.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin = end - pd.DateOffset(365*2)
st = begin.strftime('%Y-%m-%d')
ed = end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
data['Date'] = data.index
mon_data = (
data[['Date', 'Adj Close']]
.groupby(pd.TimeGrouper(freq='M')).nth(-2)
.set_index('Date')
)

simplest solution is to take the index of your newly created dataframe and subtract the number of days you want to go back:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-1-n]))
mon_data.index = mon_data.index - datetime.timedelta(days=n)
also, seeing your data, i think that you should resample not to ' month end frequency' but rather to 'business month end frequency':
.resample('BM')
but even that won't cover it all, because for instance December 29, 2017 is a business month end, but this date doesn't appear in your data (which ends in December 08 2017). so you could add a small fix to that (assuming the original data is sorted by the date):
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
so, the full code will look like:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('BM').apply(lambda x: x[-1-n]))
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
mon_data.index = mon_data.index - datetime.timedelta(days=n)
btw: your .set_index(data.index) throw an error because data and mon_data are in different dimensions (mon_data is monthly grouped_by)

Related

How to find the last monday's data only from a dataframe in python?

I have a dataframe that contains 1 years of weekly OHLC data.
What do I need ?
list only the last monday's data of each month. For example, May has 5 weeks and I want to list the last monday's data of may and need to discard the rest. Here's the code that I tried and I'm able to list the data on weekly basis. I got stuck here!
Any help would be appreciated!
import pandas as pd
import yfinance as yf
import datetime
from datetime import date, timedelta
periods=pd.date_range(start='2021-4-30',periods=60,freq='W')
start = periods[0].strftime('%Y-%m-%d')
end = periods[-1].strftime('%Y-%m-%d')
symbol="^NSEI"
df=yf.download(symbol,start,end,interval="1wk",index=periods)
You can use groupby(pd.Grouper()) to group by month and get the latest record.
# reset index to flatten columns
df = df.reset_index()
# copy date column to label last monday of a month
df['last_monday_of_month'] = df['Date']
# groupby month and get latest record
df.groupby(pd.Grouper(freq='M', key='Date')).last().reset_index()

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64

How to find missing dates in an excel file by python

I'm a beginner in python. I have an excel file. This file shows the rainfall amount between 2016-1-1 and 2020-6-30. It has 2 columns. The first column is date, another column is rainfall. Some dates are missed in the file (The rainfall didn't estimate). For example there isn't a row for 2016-05-05 in my file. This a sample of my excel file.
Date rainfall (mm)
1/1/2016 10
1/2/2016 5
.
.
.
12/30/2020 0
I want to find the missing dates but my code doesn't work correctly!
import pandas as pd
from datetime import datetime, timedelta
from matplotlib import dates as mpl_dates
from matplotlib.dates import date2num
df=pd.read_excel ('rainfall.xlsx')
a= pd.date_range(start = '2016-01-01', end = '2020-06-30' ).difference(df.index)
print(a)
Here' a beginner friendly way of doing it.
First you need to make sure, that the Date in your dataframe is really a date and not a string or object.
Type (or print) df.info().
The date column should show up as datetime64[ns]
If not, df['Date'] = pd.to_datetime(df['Date'], dayfirst=False)fixes that. (Use dayfirst to tell if the month is first or the day is first in your date string because Pandas doesn't know. Month first is the default, if you forget, so it would work without...)
For the tasks of finding missing days, there's many ways to solve it. Here's one.
Turn all dates into a series
all_dates = pd.Series(pd.date_range(start = '2016-01-01', end = '2020-06-30' ))
Then print all dates from that series which are not in your dataframe "Date" column. The ~ sign means "not".
print(all_dates[~all_dates.isin(df['Date'])])
Try:
df = pd.read_excel('rainfall.xlsx', usecols=[0])
a = pd.date_range(start = '2016-01-01', end = '2020-06-30').difference([l[0] for l in df.values])
print(a)
And the date in the file must like 2016/1/1
To find the missing dates from a list, you can apply Conditional Formatting function in Excel. 4. Click OK > OK, then the position of the missing dates are highlighted. Note: The last date in the date list will be highlighted.
this TRICK Is not with python,a NORMAL Trick

Am i doing something wrong with the loops?

I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)

Pandas QuarterBegin(): Possible Bug when calculating First of quarter

I would like to calculate the "first of quarter" in a pandas dataframe. However I am running into some problems. My Pandas version is 0.17.1.
import pandas as pd
import datetime as dt
test=pd.Timestamp(dt.datetime(2011,1,20))
test=test.tz_localize('Europe/Rome')
previousquarter=test-pd.tseries.offsets.QuarterBegin()
nextquarter=test+pd.tseries.offsets.QuarterBegin()
My expected results would be previousquarter = (2011,1,1) and nextquarter = (2011,4,1). But what I get is previousquarter = (2010,12,1) and nextquarter = (2011,3,1).
I have also tried it without tz_localize. However, it did not change the result.
Am I doing something wrong here or is this a bug somewhere?
Thanks in advance!
P.S. I know I could correct it by shifting one month, but this seems to be a rather crude workaround.
Yup, looks like a bug: https://github.com/pydata/pandas/issues/8435
There is a better workaround than shifting a month though:
offsets.QuarterBegin(startingMonth=1)
The answer given my Marshall worked fine for me except for first days of each year where it was pointing to first date of the last quarter of the previous year. eg of 2018-01-01 I was getting 2017-10-01
I had to do the following to handle that :
(date + pd.tseries.offsets.DateOffset(days=1)) - pd.tseries.offsets.QuarterBegin(startingMonth=1)
where date is a datetime.datetime object.
Reproduced #Abhi's bug in Pandas 1.1.3. In fact, couldn't get consistent results from QuarterEnd or QuarterBegin for all starting dates within the quarter. Instead resorted to going to the end of the prior quarter and adding a day, or to the beginning of the next quarter and subtracting a day. Note QuarterEnd(startingMonth=12)
import pandas as pd
print(" date Quarter Quarter begin Quarter end ")
for yr in range(2020, 2021):
for mo in range(1,13):
for dy in range(1,4):
date = pd.Timestamp(yr, mo, dy)
if dy == 3:
date = date + pd.tseries.offsets.MonthEnd()
qbegin = date + pd.offsets.QuarterEnd(-1, startingMonth=12) + pd.offsets.Day(1)
qend = date + pd.offsets.QuarterBegin(1, startingMonth=1) - pd.offsets.Day(1)
print("{} {} {} {}".format(date, date.quarter, qbegin, qend))

Categories