Pandas: monthly date range with number of days - python

Suppose I have a start and end dates like so:
start_d = datetime.date(2017, 7, 20)
end_d = datetime.date(2017, 9, 10)
I wish to obtain a Pandas DataFrame that looks like this:
Month NumDays
2017-07 12
2017-08 31
2017-09 10
It shows the number of days in each month that is contained in my range.
So far I can generate the monthly series with pd.date_range(start_d, end_d, freq='MS').

You can use date_range by default day frequency first, then create Series and resample with size. Last convert to month period by to_period:
import datetime as dt
start_d = dt.date(2017, 7, 20)
end_d = dt.date(2017, 9, 10)
s = pd.Series(index=pd.date_range(start_d, end_d), dtype='float64')
df = s.resample('MS').size().rename_axis('Month').reset_index(name='NumDays')
df['Month'] = df['Month'].dt.to_period('m')
print (df)
Month NumDays
0 2017-07 12
1 2017-08 31
2 2017-09 10
Thank you Zero for simplifying solution:
df = s.resample('MS').size().to_period('m').rename_axis('Month').reset_index(name='NumDays')

Related

How do I adjust the dates of a column in pandas according to a threshhold?

I have a data frame with a datetime column like so:
dates
0 2017-09-19
1 2017-08-28
2 2017-07-13
I want to know if there is a way to adjust the dates with this condition:
If the day of the date is before 15, then change the date to the end of last month.
If the day of the date is 15 or after, then change the date to the end of the current month.
My desired output would look something like this:
dates
0 2017-09-30
1 2017-08-31
2 2017-06-30
Using np.where and Josh's suggestion of MonthEnd, this can be simplified a bit.
Given:
dates
0 2017-09-19
1 2017-08-28
2 2017-07-13
Doing:
from pandas.tseries.offsets import MonthEnd
# Where the day is less than 15,
# Give the DateEnd of the previous month.
# Otherwise,
# Give the DateEnd of the current month.
df.dates = np.where(df.dates.dt.day.lt(15),
df.dates.add(MonthEnd(-1)),
df.dates.add(MonthEnd(0)))
print(df)
# Output:
dates
0 2017-09-30
1 2017-08-31
2 2017-06-30
Easy with MonthEnd
Let's set up the data:
dates = pd.Series({0: '2017-09-19', 1: '2017-08-28', 2: '2017-07-13'})
dates = pd.to_datetime(dates)
Then:
from pandas.tseries.offsets import MonthEnd
pre, post = dates.dt.day < 15, dates.dt.day >= 15
dates.loc[pre] = dates.loc[pre] + MonthEnd(-1)
dates.loc[post] = dates.loc[post] + MonthEnd(1)
Explanation: create masks (pre and post) first. Then use the masks to either get month end for current or previous month, as appropriate.

Separation of year, month and day in an 8-digit number

I am new to Python/pandas.
I have a data frame that looks like this
x = pd.DataFrame([20210901,20210902, 20210903, 20210904])
[out]:
0
0 20210901
1 20210902
2 20210903
3 20210904
I want to separate each row as follows: For example
year = 2021
month = 9
day = 1
or I have a list for each row like this:
[2021,9,1]
You can use pd.to_datetime to convert the entire column to datetime type.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'Col': [20210917,20210918, 20210919, 20210920]})
>>>
>>> df.Col = pd.to_datetime(df.Col, format='%Y%m%d')
>>> df
Col
0 2021-09-17
1 2021-09-18
2 2021-09-19
3 2021-09-20
>>> df['Year'] = df.Col.dt.year
>>> df['Month'] = df.Col.dt.month
>>> df['Day'] = df.Col.dt.day
>>>
>>> df
Col Year Month Day
0 2021-09-17 2021 9 17
1 2021-09-18 2021 9 18
2 2021-09-19 2021 9 19
3 2021-09-20 2021 9 20
If you want the result as list, you can use list comprehension along with zip function.
>>> [(year, month, day) for year, month, day in zip(df.Year, df.Month, df.Day)]
[(2021, 9, 17), (2021, 9, 18), (2021, 9, 19), (2021, 9, 20)]
Separate each row as follows three new columns: year, month, day
import pandas as pd
df = pd.DataFrame({"date":[20210901,20210902, 20210903, 20210904]})
# split date field to three fields: year month day
df["year"] = df["date"].apply(lambda x: str(x)[:4])
df["month"] = df["date"].apply(lambda x: str(x)[4:6])
df["day"] = df["date"].apply(lambda x: str(x)[6:])
print(df)
result is as following
# result is:
date year month day
0 20210901 2021 09 01
1 20210902 2021 09 02
2 20210903 2021 09 03
3 20210904 2021 09 04
I created a simple function which will return a zipped list in the format you wanted
def convert_to_dates(df):
dates = []
for i,v in x.iterrows():
dates.append(v.values[0])
for i in range(len(dates)):
dates[i] = str(dates[i])
years = []
months = []
days = []
for i in range(len(dates)):
years.append(dates[i][0:4])
months.append(dates[i][4:6])
days.append(dates[i][6:8])
return list(zip(years, months, days))
Call it by using convert_to_dates(x)
Output:
In [4]: convert_to_dates(x)
Out[4]:
[('2021', '09', '01'),
('2021', '09', '02'),
('2021', '09', '03'),
('2021', '09', '04')]

Is there a better way to determine day in future using integer values

Here is the problem:
If the int values [0,7) (0, 1, 2, 3, 4, 5, 6) refer to Monday through
Sunday, and today is Monday, what day of the week will it be in 999
days?
Here is how I solved it:
import datetime
#Capture the First Date
day1 = datetime.date(2021, 1, 25)
print('day1:', day1.ctime())
# Capture the Second Date
day2 = datetime.date(2023, 10, 21)
print('day2:', day2.ctime())
# Find the difference between the dates
print('Number of Days:', day1-day2)
Returns:
day1: Mon Jan 25 00:00:00 2021
day2: Sat Oct 21 00:00:00 2023
Number of Days: -999 days, 0:00:00
Use timedelta to add n days from your "start date":
from datetime import date, timedelta
current = date.today()
future = current + timedelta(days=999)
print(f"{current=}", current.weekday(), current.strftime("%A"))
print(f"{future=}", future.weekday(), future.strftime("%A"))
Output:
current=datetime.date(2021, 1, 24) 6 Sunday
future=datetime.date(2023, 10, 20) 4 Friday
.weekday() returns the day of the week as an integer.
.strftime("%A") will format a date object as a Weekday name.

Pandas sum by date indexed but exclude totals column

I have a dataframe that is being read from database records and looks like this:
date added total
2020-09-14 5 5
2020-09-15 4 9
2020-09-16 2 11
I need to be able to resample by different periods and this is what I am using:
df = pd.DataFrame.from_records(raw_data, index='date')
df.index = pd.to_datetime(df.index)
# let's say I want yearly sample, then I would do
df = df.fillna(0).resample('Y').sum()
This almost works, but it is obviously summing the total column, which is something I don't want. I need total column to be the value in the date sampled in the dataframe, like this:
# What I need
date added total
2020 11 11
# What I'm getting
date added total
2020 11 25
You can do this by resampling differently for different columns. Here you want to use sum() aggregator for the added column, but max() for the total.
df = pd.DataFrame({'date':[20200914, 20200915, 20200916, 20210101, 20210102],
'added':[5, 4, 2, 1, 6],
'total':[5, 9, 11, 1, 7]})
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df_res = df.resample('Y', on='date').agg({'added':'sum', 'total':'max'})
And the result is:
df_res
added total
date
2020-12-31 11 11
2021-12-31 7 7

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

Categories