Getting the last value for each week, with the matching date - python

So I start out with a pd.Series called jpm, and I would like to group it into weeks and take the last value from each week. This works with the code below, it does get the last value. But it changes corresponding index to the Sunday of the week, and I would like it to leave it unchaged.
import pandas_datareader.data as web
import pandas as pd
start = pd.datetime(2015, 11, 1)
end = pd.datetime(2015, 11, 17)
raw_jpm = web.DataReader("JPM", 'yahoo', start, end)["Adj Close"]
jpm = raw_jpm.ix[raw_jpm.index[::2]]
jpm is now
Date
2015-11-02 64.125610
2015-11-04 64.428918
2015-11-06 66.982593
2015-11-10 66.219427
2015-11-12 64.575682
2015-11-16 65.074678
Name: Adj Close, dtype: float64
I want to do some operations to it, such as
weekly = jpm.groupby(pd.TimeGrouper('W')).last()
weekly is now
Date
2015-11-08 66.982593
2015-11-15 64.575682
2015-11-22 65.074678
Freq: W-SUN, Name: Adj Close, dtype: float64
which is great, except all my dates got changed. The output I want, is:
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678

you can do it this way:
In [15]: jpm
Out[15]:
Date
2015-11-02 64.125610
2015-11-04 64.428918
2015-11-06 66.982593
2015-11-10 66.219427
2015-11-12 64.575682
2015-11-16 65.074678
Name: Adj Close, dtype: float64
In [16]: jpm.groupby(jpm.index.week).transform('last').drop_duplicates(keep='last')
Out[16]:
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678
dtype: float64
Explanation:
In [17]: jpm.groupby(jpm.index.week).transform('last')
Out[17]:
Date
2015-11-02 66.982593
2015-11-04 66.982593
2015-11-06 66.982593
2015-11-10 64.575682
2015-11-12 64.575682
2015-11-16 65.074678
dtype: float64

You could provide a DateOffset by specifying the class name Week and indicating the weekly frequency W-FRI, by setting the dayofweek property as 4 [Monday : 0 → Sunday : 6]
jpm.groupby(pd.TimeGrouper(freq=pd.offsets.Week(weekday=4))).last().tail(5)
Date
2016-08-19 65.860001
2016-08-26 66.220001
2016-09-02 67.489998
2016-09-09 66.650002
2016-09-16 65.820000
Freq: W-FRI, Name: Adj Close, dtype: float64
If you want the starting date as the next monday from start date and the previous sunday from the end date, you could do this way:
from datetime import datetime, timedelta
start = datetime(2015, 11, 1)
monday = start + timedelta(days=(7 - start.weekday()))
end = datetime(2016, 9, 30)
sunday = end - timedelta(days=end.weekday() + 1)
print (monday)
2015-11-02 00:00:00
print (sunday)
2016-09-25 00:00:00
Then, use it as:
jpm = web.DataReader('JPM', 'yahoo', monday, sunday)["Adj Close"]
jpm.groupby(pd.TimeGrouper(freq='7D')).last()
To get it all on a Sunday, as you specified the range Monday → Sunday and Sunday being the last day for the date to be considered, you could do a small hack:
monday_new = monday - timedelta(days=3)
jpm = web.DataReader('JPM', 'yahoo', monday_new, sunday)["Adj Close"]
jpm.groupby(pd.TimeGrouper(freq='W')).last().head()
Date
2015-11-01 62.863448
2015-11-08 66.982593
2015-11-15 64.145175
2015-11-22 66.082449
2015-11-29 65.720431
Freq: W-SUN, Name: Adj Close, dtype: float64
Now that you've posted the desired output, you can arrive at the result using transform method instead of taking the aggregated last, so that it returns an object that is indexed the same size as the one being grouped.
df = jpm.groupby(pd.TimeGrouper(freq='W')).transform('last').reset_index(name='Last')
df
df['counter'] = (df['Last'].shift() != df['Last']).astype(int).cumsum()
df.groupby(['Last','counter'])['Date'].apply(lambda x: np.array(x)[-1]) \
.reset_index().set_index('Date').sort_index()['Last']
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678
Name: Last, dtype: float64
Note: This is capable of handling repeated entries that occur in two separate dates due to the inclusion of the counter column which bins them separately into two buckets.

It seems a little tricky to do this in pure pandas, so I used numpy
import numpy as np
weekly = jpm.groupby(pd.TimeGrouper('W-SUN')).last()
weekly.index = jpm.index[np.searchsorted(jpm.index, weekly.index, side="right")-1]

Related

First week of year considering the first day last year

I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]

Pandas df.applymap() produces unwanted type conversion of datetime64[ns] to timedelta

When executing applymap on a masked subset of df datetime columns, two of the four columns are converted to timedelta. Can't figure out what might be happening, perhaps an error similar to https://github.com/pandas-dev/pandas/issues/18493? But why only two of the four?!
print time_data.dtypes, time_data[nt].dtypes
time_data[nt] = time_data[nt].applymap(lambda x: x.strftime('%I:%M:%S %p') if pd.notnull(x) else pd.NaT)
time_data['Total Clock Time'] = time_data['Total Clock Time'].apply(lambda x: x.seconds/3600)
print time_data.dtypes, time_data[nt].dtypes
Date object
Name object
In AM datetime64[ns]
Out AM datetime64[ns]
In PM datetime64[ns]
Out PM datetime64[ns]
Sick Time datetime64[ns]
Total Clock Time object
dtype: object
In AM datetime64[ns]
Out AM datetime64[ns]
In PM datetime64[ns]
Out PM datetime64[ns]
Sick Time datetime64[ns]
dtype: object
Date object
Name object
In AM object
Out AM object
In PM timedelta64[ns]
Out PM timedelta64[ns]
Sick Time datetime64[ns]
Total Clock Time float64
dtype: object
In AM object
Out AM object
In PM timedelta64[ns]
Out PM timedelta64[ns]
Sick Time datetime64[ns]
dtype: object
the data look like this:
Date Name In AM Out AM \
0 2017-11-06 AUSTIN LEWIS 1900-01-01 06:10:24 1900-01-01 12:03:23
1 2017-11-06 FRED MOORE 1900-01-01 06:58:37 1900-01-01 12:12:11
2 2017-11-06 KERRIE PAUSSA 1900-01-01 11:58:48 1900-01-01 19:39:49
3 2017-11-06 OMAR CUELLAR NaT NaT
4 2017-11-07 AUSTIN LEWIS 1900-01-01 07:07:27 1900-01-01 12:06:43
In PM Out PM Sick Time
0 1900-01-01 12:32:03 1900-01-01 17:31:50 NaT
1 1900-01-01 12:42:53 1900-01-01 17:31:50 NaT
2 NaT NaT NaT
3 1900-01-01 20:00:19 1900-01-01 23:59:41 NaT
4 1900-01-01 12:35:26 1900-01-01 17:33:20 NaT
strftime is going to return an object by default. The other two columns whose dtypes are timedelta are that way because you said to fill the blanks with pd.NaT. Use np.NaN instead like this:
df[nt].applymap(lambda x : x.strftime('%I:%M:%S %p') if pd.notnull(x) else np.NaN)
df[nt].applymap(lambda x : x.strftime('%I:%M:%S %p') if pd.notnull(x) else np.NaN).dtypes

Dates from 1900-01-01 are added to my 'Time' after using df['Time'] = pd.to_datetime(phData['Time'], format='%H:%M:%S')

I am a self taught coder (for around a year, so new). Here is my data
phData = pd.read_excel('phone call log & duration.xlsx')
called from called to Date Time Duration in (sec)
0 7722078014 7722012013 2017-07-01 10:00:00 303
1 7722078014 7722052018 2017-07-01 10:21:00 502
2 7722078014 7450120521 2017-07-01 10:23:00 56
The dtypes are:
called from int64
called to int64
Date datetime64[ns]
Time object
Duration in (sec) int64
dtype: object
phData['Time'] = pd.to_datetime(phData['Time'], format='%H:%M:%S')
phData.head(2)
called from called to Date Time Duration in (sec)
0 7722078014 7722012013 2017-07-01 1900-01-01 10:00:00 303
1 7722078014 7722052018 2017-07-01 1900-01-01 10:21:00 502
I've managed to change the 'Time' to datetime64[ns] but somehow dates have been added?? From where I have no idea? I want to be able to analyse the Date and Time using Pandas which I'm happy to do. To explore calls made between dates and time, frequency etc. I think also I will be able to save it so it will work in Orange3. But Orange3 won't recognise the Time as a time format. I've tried stripping out the 1900-01-01 but get an error saying it can only be done if an object. I think the Time isn't a datetime but a datetime.time ??? and I'm not sure why this matters and how to simply have 2 columns one Date and another Time, that Pandas will recognise for me to mine. I have looked at countless posts and that's where I found how to use pd.to_datetime and that my issue might be datetime.time but I'm stuck after this.
Pandas doesn't have such dtype as Time. You can have either datetime or timedelta dtype.
Option 1: combine Date and Time into single column:
In [23]: df['TimeStamp'] = pd.to_datetime(df.pop('Date') + ' ' + df.pop('Time'))
In [24]: df
Out[24]:
called from called to Duration in (sec) TimeStamp
0 7722078014 7722012013 303 2017-07-01 10:00:00
1 7722078014 7722052018 502 2017-07-01 10:21:00
2 7722078014 7450120521 56 2017-07-01 10:23:00
Option 2: convert Date to datetime and Time to timedelta dtype:
In [27]: df.Date = pd.to_datetime(df.Date)
In [28]: df.Time = pd.to_timedelta(df.Time)
In [29]: df
Out[29]:
called from called to Date Time Duration in (sec)
0 7722078014 7722012013 2017-07-01 10:00:00 303
1 7722078014 7722052018 2017-07-01 10:21:00 502
2 7722078014 7450120521 2017-07-01 10:23:00 56
In [30]: df.dtypes
Out[30]:
called from int64
called to int64
Date datetime64[ns]
Time timedelta64[ns]
Duration in (sec) int64
dtype: object

How can I get all entries in a DateTimeIndexed pandas series that occur in a list of days?

I have a series of hourly data, and a Python list of dates that I'm interested in examining:
>>> hourly
KWH_DTTM
2015-06-20 15:00:00 2138.4
2015-06-20 16:00:00 4284.0
2015-06-20 17:00:00 4168.8
...
2017-06-21 21:00:00 2743.2
2017-06-21 22:00:00 2757.6
2017-06-21 23:00:00 2635.2
Freq: H, Name: KWH, Length: 17577, dtype: float64
>>> days
[datetime.date(2017, 5, 5), datetime.date(2017, 5, 8), datetime.date(2017, 5, 9), datetime.date(2017, 6, 2)]
I am trying to figure out how to select all entries from hourly that land on a day in days (days is about 50 entries long, and dates can be arbitrary). days is currently a list of Python date objects, but I don't care if they're strings, etc.
If I index hourly with days, I get a series that has been resampled to daily intervals:
>>> hourly[days]
KWH_DTTM
2017-05-05 2628.0
2017-05-08 2628.0
2017-05-09 2548.8
2017-06-02 2512.8
Name: KWH, Length: 30, dtype: float64
If I index with a single day, rendered to a string, I get the desired output for that day:
>>> hourly['2017-5-5']
KWH_DTTM
2017-05-05 00:00:00 2505.6
2017-05-05 01:00:00 2563.2
2017-05-05 02:00:00 2505.6
...
2017-05-05 21:00:00 2268.0
2017-05-05 22:00:00 2232.0
2017-05-05 23:00:00 2088.0
Freq: H, Name: KWH, Length: 24, dtype: float64
Is there a way to do this besides looping over my list of days and concatenating the results?
Consider building a boolean series built from a Series.apply() passing every datetimeindex value and checking if it equals each element of dates via a list comprehension. Then use this boolean series to filter hourly series.
# DATA EXAMPLE
np.random.seed(45)
hourly = pd.Series(index=pd.DatetimeIndex(start='2016-09-05 00:00:00',
periods=17577, freq='H'),
data=np.random.randn(17577),
name='KWH_DTTM')
days = [datetime.date(2017, 5, 5), datetime.date(2017, 5, 8),
datetime.date(2017, 5, 9), datetime.date(2017, 6, 2)]
# BOOLEAN SERIES
bools = pd.Series(hourly.index.values).apply(lambda x: \
max([x.date() == d for d in days]))
bools.index = hourly.index
# FILTER ORIGINAL SERIES
newhourly = hourly[bools]
print(newhourly.head(10))
# 2017-05-05 00:00:00 -0.238799
# 2017-05-05 01:00:00 -0.263365
# 2017-05-05 02:00:00 -0.249632
# 2017-05-05 03:00:00 0.131630
# 2017-05-05 04:00:00 -1.279383
# 2017-05-05 05:00:00 0.411316
# 2017-05-05 06:00:00 -2.059022
# 2017-05-05 07:00:00 -1.008058
# 2017-05-05 08:00:00 -0.365651
# 2017-05-05 09:00:00 1.515522
# Name: KWH_DTTM, dtype: float64
print(newhourly.tail(10))
# 2017-06-02 14:00:00 0.329567
# 2017-06-02 15:00:00 -0.618604
# 2017-06-02 16:00:00 0.848719
# 2017-06-02 17:00:00 -1.152657
# 2017-06-02 18:00:00 0.269618
# 2017-06-02 19:00:00 -1.806861
# 2017-06-02 20:00:00 -0.188643
# 2017-06-02 21:00:00 0.515790
# 2017-06-02 22:00:00 0.384695
# 2017-06-02 23:00:00 1.115494
# Name: KWH_DTTM, dtype: float64
You could convert hourly to a DataFrame, and then use .isin():
df = hourly.reset_index(name='KWH').rename(columns={'index':'hours'})
df = df[df.hours.apply(lambda x: datetime.date(x.year, x.month, x.day)).isin(dates)]
Here's the complete code with random data:
import pandas as pd
import datetime
import random
random_data = [random.randint(1000,2000) for x in range(1,1000)]
hours = [datetime.datetime(random.randint(2014,2016),random.randint(1,12),random.randint(1,28),random.randint(1,23),0) for x in range(1,1000)]
hourly = pd.Series(data=random_data, index=h)
dates = [datetime.date(random.randint(2014,2016),random.randint(1,12),random.randint(1,28)) for x in range(1,10)]
df = hourly.reset_index(name='KWH').rename(columns={'index':'hours'})
df = df[df.hours.apply(lambda x: datetime.date(x.year, x.month, x.day)).isin(dates)]

Dataframe not printing properly

I downloaded a dataframe to csv, made some changes and then tried to call is again . for some reasons the date column is all mixed up.
can some one please help and tell me why I am getting this message.
before saving as csv my df looked like this:
aapl = web.DataReader("AAPL", "yahoo", start, end)
bbry = web.DataReader("BBRY", "yahoo", start, end)
lulu = web.DataReader("LULU", "yahoo", start, end)
amzn = web.DataReader("AMZN", "yahoo", start, end)
# Below I create a DataFrame consisting of the adjusted closing price of these stocks, first by making a list of these objects and using the join method
stocks = pd.DataFrame({"AAPL": aapl["Adj Close"],
"BBRY": bbry["Adj Close"],
"LULU": lulu["Adj Close"],
"AMZN":amzn["Adj Close"]}, pd.date_range(start, end, freq='BM'))
​
stocks.head()
​
Out[60]:
AAPL AMZN BBRY LULU
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-30 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-30 78.414750 202.509995 14.700000 74.730003
In [74]:
stocks.to_csv('A5.csv', encoding='utf-8')
after reading the correct csv it now looks like this:
In [81]:
stocks1.head()
Out[81]:
Unnamed: 0 AAPL AMZN BBRY LULU
0 2011-11-30 00:00:00 49.987684 192.289993 17.860001 49.700001
1 2011-12-30 00:00:00 52.969683 173.100006 14.500000 46.660000
2 2012-01-31 00:00:00 59.702715 194.440002 16.629999 63.130001
3 2012-02-29 00:00:00 70.945373 179.690002 14.170000 67.019997
4 2012-03-30 00:00:00 78.414750 202.509995 14.700000 74.730003
why is it not recognizing the date column as date?
Thanks for your help
I would suggest you to use HDF store instead of CSV - it's much faster, it preserves your dtypes, you can conditionally select subsets of your data sets, it supports fast compression, etc.
import pandas_datareader.data as web
stocklist = ['AAPL','BBRY','LULU','AMZN']
p = web.DataReader(stocklist, 'yahoo', '2011-11-01', '2012-04-01')
df = p['Adj Close'].resample('M').last()
print(df)
# saving DF to HDF file
store = pd.HDFStore(r'd:/temp/stocks.h5')
store.append('stocks', df, data_columns=True, complib='blosc', complevel=5)
store.close()
Output:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
let's read our data back from the HDF file:
In [9]: store = pd.HDFStore(r'd:/temp/stocks.h5')
In [10]: x = store.select('stocks')
In [11]: x
Out[11]:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
you can select your data conditionally:
In [12]: x = store.select('stocks', where="AAPL >= 50 and AAPL <= 70")
In [13]: x
Out[13]:
AAPL AMZN BBRY LULU
Date
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
check index dtype:
In [14]: x.index.dtype
Out[14]: dtype('<M8[ns]')
In [15]: x.index.dtype_str
Out[15]: 'datetime64[ns]'

Categories