I am working with some historical data on fiscal transfers in Canada. The downloaded data is in the format of fiscal year i.e.
Year Quebec Alberta
1980-1981 2000 4000
1981-1982 3000 6000
I am using the pandas library. However, when I try to make any visualizations using either matplot or sns, it generates an error either not recognizing 'Year' as a numerical value or ('DataFrame' object has no attribute 'Year'). However, when I change the values in the csv to a single year i.e.
Year Quebec Alberta
1980 2000 4000
1981 3000 6000
it works perfectly fine. Is there a way for Python to treat fiscal year values like 1980-1981 the same as normal year. Any advice would be much appreciated.
You can use 2years periods, but if print DataFrame columns cannot see end year:
print (df)
Year Quebec Alberta
0 1980 2000 4000
1 1981 3000 6000
df['Year'] = df['Year'].apply(lambda x: pd.Period(x, freq='2A-DEC'))
print (df['Year'])
0 1980
1 1981
Name: Year, dtype: period[2A-DEC]
print (df['Year'].dt.to_timestamp('A', how='s'))
0 1980-12-31
1 1981-12-31
Name: Year, dtype: datetime64[ns]
print (df['Year'].dt.to_timestamp('A', how='e'))
0 1981-12-31 23:59:59.999999999
1 1982-12-31 23:59:59.999999999
Name: Year, dtype: datetime64[ns]
But I think most easier is create 2 columns for start and end year:
print (df)
Year Quebec Alberta
0 1980-1981 2000 4000
1 1981-1982 3000 6000
df[['StartYear','EndYear']] = df['Year'].str.split('-', expand=True).astype(int)
print (df)
Year Quebec Alberta StartYear EndYear
0 1980-1981 2000 4000 1980 1981
1 1981-1982 3000 6000 1981 1982
Related
I have a meteorological data set with daily precipitation values for 120 years. I would like to prepare this in such a way that I have monthly average values for 4 climate periods at the end. Example: Average precipitation January, February, March, ... for period 1981 - 2010, average precipitation January, February, March, ... for period 2011 - 2040 and so on.
Data set looks like this (is available as csv file, read in as pandas dataframe):
year month day lon lat value
0 1981 1 1 0 0 0.522592
1 1981 1 2 0 0 2.692495
2 1981 1 3 0 0 0.556698
3 1981 1 4 0 0 0.000000
4 1981 1 5 0 0 0.000000
... ... ... ... ... ... ...
43824 2100 12 27 0 0 0.000000
43825 2100 12 28 0 0 0.185120
43826 2100 12 29 0 0 10.252080
43827 2100 12 30 0 0 13.389290
43828 2100 12 31 0 0 3.523566
Here my code until now:
csv_path = r'filepath.csv'
df = pd.read_csv(csv_path, delimiter = ';')
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods = 6, freq = '30YS').strftime('%Y')
labels = [f'{a}-{b}' for a, b in zip(years, years[1:])]
(df.assign(period = pd.cut(df['year'], bins = years.astype(int), labels = labels, right = False)).groupby(df[['year', 'month']].dt.to_period('M')).agg({'period': 'first', 'value': 'sum'}).groupby('period')['value'].mean())
The best way is probably to write a loop that iterates over all months and the 4 30-year periods, but unfortunately I can't get this to work. Does anyone have any tips?
Expected Output:
Month Average
0 January 20
1 Febuary 21
2 March 19
3 April 18
To get the total value per month and then the average per periods 30 years, you need to use a double groupby:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods=6, freq='30YS').strftime('%Y')
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
(df
.assign(period=pd.cut(df['year'], bins=years.astype(int), labels=labels, right=False))
.groupby(df['date'].dt.to_period('M')).agg({'period':'first', 'value': 'sum'})
.groupby('period')['value'].mean()
)
output:
period
1981-2011 3.771785
2011-2041 NaN
2041-2071 NaN
2071-2101 27.350056
2101-2131 NaN
Name: value, dtype: float64
older answer
The expected output is not fully clear, but if you want average precipitation per quarter per year:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.to_period('Q')
df.groupby('quarter')['value'].mean()
output:
quarter
1981Q1 0.754357
2100Q4 5.470011
Freq: Q-DEC, Name: value, dtype: float64
or per quarter globally:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.quarter
df.groupby('quarter')['value'].mean()
output:
quarter
1 0.754357
4 5.470011
Name: value, dtype: float64
NB. you can do the same for other periods. For months use to_period('M') / .dt.month
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['period'] = df['date'].dt.to_period('M')
df.groupby('period')['value'].mean()
output:
period
1981-01 0.754357
2100-12 5.470011
Freq: M, Name: value, dtype: float64
I am trying to convert a column with type Integer to Year. Here is my situation:
Original Column: June 13, 1980 (United States)
I split and slice it into
Year Column: 1980
Here, I tried to use:
df['Year'] = pd.to_datetime(df['Year'])
It changed the column to have the year is different from the Original column. For example,
Original Year
1980 1970
2000 1970
2016 1970
I am looking forward to your help. Thank you in advance.
Best Regards,
Tu Le
df['Year'] = df['Original'].astype(str).astype('datetime64')
print(df)
Prints:
Original Year
0 1980 1980-01-01
1 2000 2000-01-01
2 2016 2016-01-01
If need datetimes from year, it means also added month=1 and day=1 add format parameter, here %Y for YYYY:
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
print (df)
Original Year
0 1980 1970-01-01
1 2000 1970-01-01
2 2016 1970-01-01
I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).
I have the following code that summarizes revenue by year and the number of new customers acquired
min_dates = df.groupby(['Customer ID'])['Date'].min()
df['First Purchase Date'] = df.apply(lambda row: min_dates.loc[row['Customer ID']], axis=1)
df['New Customer'] = df['Date'] <= df['First Purchase Date']
#summarize data by year and total revenue.
df['revenue'] = pd.to_numeric(df['revenue'])
df['Year'] = df['Date'].dt.year
number_new_customers = df.groupby(df['Year'])['New Customer'].sum()
number_existing_customers = df.groupby(df['Year'])['New Customer'].sum()
print(df['number new customers'])
I want the results to summarize revenue and number of customers in the same row but the data is currently being presented as two separate summaries. Any ideas on how i can combine these?
OUTPUT
>>> print(number_new_customers, total_revenue)
Year
2014 135
2015 458
2016 146
2017 174
2018 121
2019 33
Name: New Customer, dtype: float64 Year
2014 342.74
2015 651,227.71
2016 3251.26
2017 232396.94
2018 230,087.80
2019 2342.52
Name: revenue, dtype: float64
>>>
I mocked up some data real fast, the date is in epoch time.
Date,Customer ID,First Purchase Date,revenue
1339088342,4112,902531671,5.91
1539868917,3145,1256488196,9.65
1356452277,3513,1139574500,3.33
1202597342,1915,1334993455,3.90
1141307061,350,1519053454,8.43
1008402214,2096,980405444,6.83
1409224498,2606,1413988517,8.74
1372252082,3195,1127686807,3.27
1021345993,1141,1487182332,2.44
1597193200,2765,656097735,6.40
Use the below to convert to datetime columns
df['Date'] = pd.to_datetime(df['Date'], unit='s')
df['First Purchase Date'] = pd.to_datetime(df['First Purchase Date'], unit='s')
Using your code to setup your columns and the below group by you can get the output provided.
print(df.groupby(["Year"])["New Customer", "revenue"].sum())
Output:
New Customer revenue
Year
2000 4.0 13.56
2001 2.0 14.96
2002 11.0 54.23
2003 1.0 7.13
2004 6.0 31.55
2005 5.0 20.49
2006 7.0 41.29
2007 3.0 25.11
2008 6.0 23.51
2009 1.0 4.76
Hope this helps.
I have a dataframe that looks like:
id email domain created_at company
0 1 son#mail.com old.com 2017-01-21 18:19:00 company_a
1 2 boy#mail.com new.com 2017-01-22 01:19:00 company_b
2 3 girl#mail.com nadda.com 2017-01-22 01:19:00 no_company
I need summarize the data by Year, Month and if the company has a value that doesn't match "no_company":
Desired output:
year month company count
2017 1 has_company 2
no_company 1
The following works great but gives me the count for each value in the company column;
new_df = test_df['created_at'].groupby([test_df.created_at.dt.year, test_df.created_at.dt.month, test_df.company]).agg('count')
print(new_df)
result:
year month company
2017 1 company_a 1
company_b 1
no_company 1
Map a new series for has_company/no_company then groupby:
c = df.company.map(lambda x: x if x == 'no_company' else 'has_company')
y = df.created_at.dt.year.rename('year')
m = df.created_at.dt.month.rename('month')
df.groupby([y, m, c]).size()
year month company
2017 1 has_company 2
no_company 1
dtype: int64