Suggestions to convert Quarter results to Yearly results using Pandas - python

So i have the Quarterly data for Disney_Plus revenue from Q1 2020 to Q4 2021.
Desired Output of Disney_Plus_Revenue should include yearly results of 2020 and 2021. In addition to this it should also have 2010 to 2019 yearly results as None/NaN.
I initially changed the column Year to Quarter and later inserted a new column Year with 2020 and 2021 values and used groupby('Year).agg('revenue':['sum'])
But when i am trying to append 2010 to 2019 yearly revenues to this it is throwing me an error:
Solution i tried
Disney_plus_Revenue = pd.read_csv("Disney_plus_Revenue.csv")
Disney_plus_Revenue.rename(columns = {'Year':'Quarter'},inplace = True)
Disney_plus_Revenue.insert(0,"Year",["2020","2020","2020","2020","2021","2021","2021","2021"],True)
Disney_plus_Revenue.rename(columns = {'Revenue':'Disney_Plus_Revenue'},inplace = True)
Disney_plus_Revenue = Disney_plus_Revenue.groupby('Year').agg({'Disney_Plus_Revenue': ['sum']})
DS_new = pd.DataFrame(np.array([["2010",None],["2011",None],["2012",None],["2013",None],["2014",None],["2015",None],["2016",None],["2017",None],["2018",None],["2019",None]]), columns=['Year','Disney_Plus_Revenue']).append(Disney_plus_Revenue, ignore_index=True)
Error -

First of all, using .agg will give you a MultiIndex DataFrame.
Since you're using only one aggregation function, maybe you should group that way:
agg = Disney_plus_Revenue.groupby("Year")["Revenues"].sum()
This will give you a Series:
Year
2020 2.802
2021 5.200
Name: Revenues, dtype: float64
Then you can create another Series with None values for each years:
indexes = np.arange(2010, 2020)
values = [None for x in indexes]
new_series = pd.Series(data=values, index=indexes, name="Revenues")
And finally, concat them:
pd.concat([new_series, agg])
2010 None
2011 None
2012 None
2013 None
2014 None
2015 None
2016 None
2017 None
2018 None
2019 None
2020 2.802
2021 5.2
Name: Revenues, dtype: object

Related

Convert daily data to weekly by taking average of the 7 days

I've created the following datafram from data given on CDC link.
googledata = pd.read_csv('/content/data_table_for_daily_case_trends__the_united_states.csv', header=2)
# Inspect data
googledata.head()
id
State
Date
New Cases
0
United States
Oct 2 2022
11553
1
United States
Oct 1 2022
8024
2
United States
Sep 30 2022
46383
3
United States
Sep 29 2022
89873
4
United States
Sep 28 2022
63763
After converting the date column to datetime and trimming the data for the last 1 year by implementing the mask operation I got the data in the last 1 year:
googledata['Date'] = pd.to_datetime(googledata['Date'])
df = googledata
start_date = '2021-10-1'
end_date = '2022-10-1'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
But the problem is I am getting the data in terms of days, but I wish to convert this data in terms of weeks ; i.e converting the 365 rows to 52 rows corresponding to weeks data taking mean of New cases the 7 days in 1 week's data.
I tried implementing the following method as shown in the previous post: link I don't think I am even applying this correctly! Because this code is not asking me to put my dataframe anywhere!
logic = {'New Cases' : 'mean'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
But I am getting the following error:
AttributeError: module 'pandas.tseries.offsets' has no attribute
'timedelta'
If I'm understanding you want to resample
df = df.set_index("Date")
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
df = df.resample("W").agg({"New Cases": "mean"}).reset_index()
You can use strftime to convert date to week number before applying groupby
df['Week'] = df['Date'].dt.strftime('%Y-%U')
df.groupby('Week')['New Cases'].mean()

Series split column with condition

My pandas series contains year values. They're not formatted consistently. For example,
df['year']
1994-1996
circa 1990
1995-1998
circa 2010
I'd like to grab the year from the string.
df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]
This doesn't work for rows with circa.
I'd like handle the rows with circa and grab only the year if it exists.
df['Year']
1994
1990
1995
2010
df['Year_Only']=df['Year'].str.extract(r'(\d{4})')[:4]
You can use str.extract then convert as pd.Int16Dtype:
df['Year'] = df['Year'].str.extract(r'(\d{4})', expand=False).astype(pd.Int16Dtype())
print(df)
# Output
Year
0 1994
1 1990
2 1995
3 2010

Updating Year in One Column Based on Month of Another Column that accounts for New Year

I have created this DataFrame:
agency coupon vintage Cbal Month CPR year month Month_Predicted_DT
0 FHLG 1.5 2021 70.090310 November 5.418937 2022 11 2022-11-01
1 FHLG 1.5 2021 70.090310 December 5.549916 2022 12 2022-12-01
2 FHLG 1.5 2021 70.090310 January 5.238943 2022 1 2022-01-01
3 FHLG 1.5 2020 52.414637 November 5.514456 2022 11 2022-11-01
4 FHLG 1.5 2020 52.414637 December 5.550490 2022 12 2022-12-01
5 FHLG 1.5 2020 52.414637 January 5.182304 2022 1 2022-01-01
Created from this original df:
agency coupon year Cbal November December January
0 FHLG 1.5 2021 70.090310 5.418937 5.549916 5.238943
1 FHLG 1.5 2020 52.414637 5.514456 5.550490 5.182304
2 FHLG 2.0 2022 44.598755 3.346706 3.715995 3.902644
3 FHLG 2.0 2021 472.209165 5.802857 5.899596 5.627774
4 FHLG 2.0 2020 269.761452 7.090993 7.091404 6.567561
Using this code:
citi = pd.read_excel("Downloads/CITI_2022_05_22(5_22).xlsx")
#Extracting just the relevant months (M, M+1, M+2)
M = citi.columns[-6]
M_1 = citi.columns[-4]
M_2 = citi.columns[-2]
#Extracting just the relevant columns
cols = ['agency-term','coupon','year','Cbal',M,M_1,M_2]
citi = citi[cols]
todays_date = date.today()
current_year = todays_date.year
citi_new['year'] = current_year
citi_new['month'] = pd.to_datetime(citi_new.Month, format="%B").dt.month
citi_new['Month_Predicted_DT'] = pd.to_datetime(citi_new[['year', 'month']].assign(DAY=1))
citi_new = citi.set_index(cols[0:4]).stack().reset_index()
citi_new.rename(columns={"level_4": "Month", 0 : "CPR", "year" : "vintage"}, inplace = True)
For reference M is the current month, and M_1 and M_2 are month+1 and month+2.
My main question is that my solution for creating the 'Month_Predicted_DT column only works if the months in question do not overlap with the new year, so if M == November or M == December, then the year in Month_Predicted_DT is not correct for January and/or February. For example, Month_Predicted_DT for January rows should be 2023-01-01 not 2022. The same would be true if M was December, then I would want rows for Jan. and Feb. to be 2023-01-01 and 2023-02-01, respectively.
I have tried to come up with a workaround using df.iterrows or np.where but just can't really get a working solution.
You could try adding 12 months to dates that are over two months out:
#get first day of the current month
start = pd.Timestamp.today().normalize().replace(day=1)
#convert month column to timestamps
dates = pd.to_datetime(df["Month"]+f"{start.year}", format="%B%Y")
#offset the year if the date is not in the next 3 months
df["Month_Predicted_DT"] = dates.where(dates>=start,dates+pd.DateOffset(months=12))

Convert Integer or Float to Year?

I am trying to convert a column with type Integer to Year. Here is my situation:
Original Column: June 13, 1980 (United States)
I split and slice it into
Year Column: 1980
Here, I tried to use:
df['Year'] = pd.to_datetime(df['Year'])
It changed the column to have the year is different from the Original column. For example,
Original Year
1980 1970
2000 1970
2016 1970
I am looking forward to your help. Thank you in advance.
Best Regards,
Tu Le
df['Year'] = df['Original'].astype(str).astype('datetime64')
print(df)
Prints:
Original Year
0 1980 1980-01-01
1 2000 2000-01-01
2 2016 2016-01-01
If need datetimes from year, it means also added month=1 and day=1 add format parameter, here %Y for YYYY:
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
print (df)
Original Year
0 1980 1970-01-01
1 2000 1970-01-01
2 2016 1970-01-01

Python - Get policy year from datetime dataframe

I have a dataframe (df) with a column in datetime format YYYY-MM-DD ('date'). I am trying to create a new column that returns the policy year, which always starts on April 1st and thus the policy year for January through March will always be the prior calander year. There are dates that are rather old so setting up individual date ranges for the sample size below wouldn't be ideal
The dataframe would look like this
df['date']
2020-12-10
2021-02-10
2019-03-31
and output should look like this
2020
2020
2018
I now know how to get the year using df['date'].dt.year. However, I am having trouble getting the dataframe to convert each year to the respective policy year so that if df['date'].dt.month >= 4 then df['date'].dt.year, else df['date'].dt.year - 1
I am not quite sure how to set this up exactly. I have been trying to avoid setting up multiple columns to do a bool for month >= 4 and then setting up different columns. I've gone so far as to set up this but get ValueError stating the series is too ambiguous
def PolYear(x):
y = x.dt.month
if y >= 4:
x.dt.year
else:
x.dt.year - 1
df['Pol_Year'] = PolYear(df['date'])
I'm wasn't sure if this was the right way to go about it so I also tried a df.loc format for >= and < 4 but len key and value are not equal. Definitely think I'm missing something super simple.
I previously had mentioned 'fiscal year', but this is incorrect.
Quang Hoand had the right idea but used the incorrect frequency in the call to to_period(self, freq). For your purposes you want to use the following code:
df.date.dt.to_period('Q-MAR').dt.qyear
This will give you:
0 2021
1 2021
2 2019
Name: date, dtype: int64
Q-MAR defines fiscal year end in March
These values are the correct fiscal years (fiscal years use the year in which they end, not where they begin[reference]). If you you want to have the output using the year in which they begin, it's simple:
df.date.dt.to_period('Q-MAR').dt.qyear - 1
Giving you
0 2020
1 2020
2 2018
Name: date, dtype: int64
qyear docs
This is qyear:
df.date.dt.to_period('Q').dt.qyear
Output:
0 2020
1 2021
2 2019
Name: date, dtype: int64

Categories