Calculate the average between the same month in a time series - python

I have a dataset between 2002 - 2018 which contains 1 value per month, 198 rows in total.
I want to know how I can average all the values from the same month (e.g. January/2003 + ... + January/2018)
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')
df = pd.read_csv('turbidez.csv', parse_dates=['date'], index_col='date',date_parser=dateparse)
data = df['x']
data.head()
date
2002-07-31 8.466111
2002-08-31 6.234259
2002-09-30 8.160763
2002-10-31 4.927685
2002-11-30 8.125012
Searching a bit I visit this solution, but couldn't apply it properly to my data.
Thank you in advance for any assistance.

Use pandas.to_datetime and pandas.Series.dt.month:
# Sample data
date x
0 2002-07-31 8.466111
1 2003-07-31 6.234259
2 2002-09-30 8.160763
3 2003-09-30 4.927685
4 2002-11-30 8.125012
df["date"] = pd.to_datetime(df["date"] )
new_df = df.groupby(df["date"].dt.month).sum()
print(new_df)
Output:
x
date
7 14.700370
9 13.088448
11 8.125012

Related

Pandas - Compute sum of a column as week-wise columns

I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.

Python: how to groupby a pandas dataframe to count by hour and day?

I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.

How to split DateTime into Year and Month while summing values?

I have a dataframe with a Date column as an index as DateTime type, and a value attached to each entry.
The dates are split into yyyy-mm-dd, with each row being the next day.
Example:
Date: x:
2012-01-01 44
2012-01-02 75
2012-01-03 62
How would I split the Date column into Year and Month columns, using those two as indexes while also summing the values of all the days in a month?
Example of expected output:
Year: Month: x:
2012 1 745
2 402
3 453
...
2013 1 4353
Use Series.dt.year
Series.dt.month with aggregate sum by GroupBy.sum and rename for new columns names:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.year.rename('Year'),
df['Date'].dt.month.rename('Month')])['x'].sum().reset_index()
print (df1)
Year Month x
0 2012 1 181
Use groupby and sum:
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum())
Year Month
2012 1 181
Name: x, dtype: int64
Note that if "Date" isn't a datetime dtype column, use
df.Date = pd.to_datetime(df.Date, errors='coerce')
To convert it first.
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum()
.reset_index())
Year Month x
0 2012 1 181

Pandas add n number of new date rows to DataFrame

I want to add a number of months to the end of my dataframe.
What is the best way to append another six (or 12) months to such a dataframe using dates?
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
4 2013-11-30
Thanks
Edit: I think you might want pd.date_range
df = pd.DataFrame({'date':['2010-01-31', '2010-02-28'], 'x':[1,2]})
df['date'] = pd.to_datetime(df.date)
date x
0 2010-01-31 1
1 2010-02-28 2
Then
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='M', closed='right')}))
date x
0 2010-01-31 1.0
1 2010-02-28 2.0
0 2010-03-31 NaN
1 2010-04-30 NaN
2 2010-05-31 NaN
3 2010-06-30 NaN
4 2010-07-31 NaN
After looking into append and other loop sort of options I created this:
length = df.shape [ 0 ]
add = 12
start = df [ 'month' ].iloc [ 0 ]
count = int ( length + add )
dt = pd.date_range ( start, periods = count, freq = 'M' )
this is the dt I get. It gives the proper ending month days.
DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
now I just have to change from the DatetimeIndex.
I hope this is good code. Cheers.

pd.to_datetime is getting half my dates with flipped day / months

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Categories