For loop is only storing the last value in colum - python

I am trying to pull the week number given a date and then add that week number to the corresponding row in a pandas/python dataframe.
When I run a for loop it is only storing the last calculated value instead of recording each value.
I've tried .append but haven't been able to get anything to work.
import datetime
from datetime import date
for i in df.index:
week_number = date(df.year[i],df.month[i],df.day[i]).isocalendar()
df['week'] = (week_number[1])
Expected values:
day month year week
8 12 2021 49
19 12 2021 50
26 12 2021 51
Values I'm getting:
day month year week
8 12 2021 51
19 12 2021 51
26 12 2021 51

You can simply use Pandas .apply method to make it a one-liner:
df["week"] = df.apply(lambda x: date(x.year, x.month, x.day).isocalendar()[1], axis=1)

You need to assign it back at the correct corresponding, i, position. Using loc, should help:
for i in df.index:
week_number = date(df.year[i],df.month[i],df.day[i]).isocalendar()
df.loc[i,'week'] = (week_number[1])
prints back:
print(df)
day month year week
0 8 12 2021 49
1 19 12 2021 50
2 26 12 2021 51

You can use this, without using apply which is slow:
df['week'] = pd.to_datetime(df['year'].astype(str) + '-' + df['month'].astype(str) + '-' + df['day'].astype(str)).dt.isocalendar().week

Related

How to calculate monthly and weekly averages from a dataframe using python?

The below is my dataframe. How to calculate both monthly and weekly averages from this dataframe in python? I need to print month start&end and week start&end then the average of the month and week
**Input
SAMPLE DATASET**
kpi_id kpi_name run_date value
1 MTTR 5/17/2021 15
2 MTTR 5/18/2021 16
3 MTTR 5/19/2021 17
4 MTTR 5/20/2021 18
5 MTTR 5/21/2021 19
6 MTTR 5/22/2021 20
7 MTTR 5/23/2021 21
8 MTTR 5/24/2021 22
9 MTTR 5/25/2021 23
10 MTTR 5/26/2021 24
11 MTTR 5/27/2021 25
**expected output**
**monthly_mean**
kpi_name month_start month_end value(mean)
MTTR 5/1/2021 5/31/2021 20
**weekly_mean**
kpi_name week_start week_end value(mean)
MTTR 5/17/2021 5/23/2021 18
MTTR 5/24/2021 5/30/2021 23.5
groupby is your friend
monthly = df.groupby(pd.Grouper(key='run_date', freq='M')).mean()
weekly = df.groupby(pd.Grouper(key='run_date', freq='W')).mean()
Extending the answer from igrolvr to match your expected output:
# transformation from string to datetime for groupby
df['run_date'] = pd.to_datetime(df['run_date'], format='%m/%d/%Y')
# monthly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
monthly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='M')])['value'].mean().reset_index()
# Getting the start of month
monthly['month_start'] = monthly['run_date'] - pd.offsets.MonthBegin(1)
# Renaming the run_date column to month_end
monthly = monthly.rename({'run_date': 'month_end'}, axis=1)
print(monthly)
# weekly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
weekly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='W')])['value'].mean().reset_index()
# Getting the start of week and adding one day,
# because start of week is sunday
weekly['week_start'] = weekly['run_date'] - pd.offsets.Week(1) + pd.offsets.Day(1)
# Renaming the run_date column to week_end
weekly = weekly.rename({'run_date': 'week_end'}, axis=1)
print(weekly)

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

Get a week startdate from week number for entire dateframe in python

I am looking for week start date for entire date frame , with format of dd-mm-yyyy,
Below week number :(src_data['WEEK'])
28
29
30
31
32
33
34
35
code :
src_data['firstdayofweek'] = datetime.datetime.strptime(f'{2020}-W{int(src_data['WEEK'] )- 1}-1','%Y-W%W-%w').date()
Output :
Thanks in advance
You can add a year and weekday as strings and parse to_datetime with the appropriate directives (see also here). If desired, convert to string with strftime:
src_data = pd.DataFrame({'WEEK':[28,29,30,31,32,33,34,35]})
year, weekday = '2020', '1'
src_data['DATE'] = pd.to_datetime(year + src_data['WEEK'].astype(str) + weekday,
format='%G%V%u').dt.strftime('%d-%m-%Y')
# src_data
# WEEK DATE
# 0 28 06-07-2020
# 1 29 13-07-2020
# 2 30 20-07-2020
# 3 31 27-07-2020
# 4 32 03-08-2020
# 5 33 10-08-2020
# 6 34 17-08-2020
# 7 35 24-08-2020

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?
Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')
If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

Categories