Summarizing meteorological data with the help of a loop - python

I have a meteorological data set with daily precipitation values for 120 years. I would like to prepare this in such a way that I have monthly average values for 4 climate periods at the end. Example: Average precipitation January, February, March, ... for period 1981 - 2010, average precipitation January, February, March, ... for period 2011 - 2040 and so on.
Data set looks like this (is available as csv file, read in as pandas dataframe):
year month day lon lat value
0 1981 1 1 0 0 0.522592
1 1981 1 2 0 0 2.692495
2 1981 1 3 0 0 0.556698
3 1981 1 4 0 0 0.000000
4 1981 1 5 0 0 0.000000
... ... ... ... ... ... ...
43824 2100 12 27 0 0 0.000000
43825 2100 12 28 0 0 0.185120
43826 2100 12 29 0 0 10.252080
43827 2100 12 30 0 0 13.389290
43828 2100 12 31 0 0 3.523566
Here my code until now:
csv_path = r'filepath.csv'
df = pd.read_csv(csv_path, delimiter = ';')
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods = 6, freq = '30YS').strftime('%Y')
labels = [f'{a}-{b}' for a, b in zip(years, years[1:])]
(df.assign(period = pd.cut(df['year'], bins = years.astype(int), labels = labels, right = False)).groupby(df[['year', 'month']].dt.to_period('M')).agg({'period': 'first', 'value': 'sum'}).groupby('period')['value'].mean())
The best way is probably to write a loop that iterates over all months and the 4 30-year periods, but unfortunately I can't get this to work. Does anyone have any tips?
Expected Output:
Month Average
0 January 20
1 Febuary 21
2 March 19
3 April 18

To get the total value per month and then the average per periods 30 years, you need to use a double groupby:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods=6, freq='30YS').strftime('%Y')
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
(df
.assign(period=pd.cut(df['year'], bins=years.astype(int), labels=labels, right=False))
.groupby(df['date'].dt.to_period('M')).agg({'period':'first', 'value': 'sum'})
.groupby('period')['value'].mean()
)
output:
period
1981-2011 3.771785
2011-2041 NaN
2041-2071 NaN
2071-2101 27.350056
2101-2131 NaN
Name: value, dtype: float64
older answer
The expected output is not fully clear, but if you want average precipitation per quarter per year:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.to_period('Q')
df.groupby('quarter')['value'].mean()
output:
quarter
1981Q1 0.754357
2100Q4 5.470011
Freq: Q-DEC, Name: value, dtype: float64
or per quarter globally:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.quarter
df.groupby('quarter')['value'].mean()
output:
quarter
1 0.754357
4 5.470011
Name: value, dtype: float64
NB. you can do the same for other periods. For months use to_period('M') / .dt.month
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['period'] = df['date'].dt.to_period('M')
df.groupby('period')['value'].mean()
output:
period
1981-01 0.754357
2100-12 5.470011
Freq: M, Name: value, dtype: float64

Related

How to find the number of days in each month between two date in different years

I'm trying to get the number of day between two days but per each month.
I found some answers but I can't figure out how to do it when the dates have two different years.
For example, I have this dataframe:
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2020-01-30' ,'2020-02-02','2020-03-03','2020-03-03','2020-02-02']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
And I want to get this dataframe:
s = (df[["StartDate", "EndDate"]]
.apply(lambda row: pd.date_range(row.StartDate, row.EndDate), axis=1)
.explode())
new = (s.groupby([s.index, s.dt.year, s.dt.month])
.count()
.unstack(level=[1, 2], fill_value=0))
new.columns = new.columns.map(lambda c: f"{c[0]}-{str(c[1]).zfill(2)}")
new = new.sort_index(axis="columns")
get all the dates in between StartDate and EndDate per row, and explode that list of dates to their own rows
group by the row id, year and month & count records
unstack the year & month identifier to be on the columns side as a multiindex
join that year & month values with a hypen in between (also zerofill months, e.g., 03)
lastly sort the year-month pairs on columns
to get
>>> new
2019-11 2019-12 2020-01 2020-02 2020-03
0 0 22 30 0 0
1 0 31 31 2 0
2 0 31 31 29 3
3 21 31 31 29 3
4 9 31 31 2 0

Convert multiple time format object as datetime format

I have a dataframe with a list of time value as object and needed to convert them to datetime, the issue is, they are not on the same format so when I try:
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M:%S')
it gives me an error
ValueError: time data '3:22' does not match format '%H:%M:%S' (match)
or if use this code
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M')
I get this error
ValueError: unconverted data remains: :58
These are the values on my data
Total call time
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
**45:48**
1:41:40
5:08:37
**3:22**
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58
times = """\
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
45:48
1:41:40
5:08:37
3:22
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58""".split()
import pandas as pd
df = pd.DataFrame(times, columns=['elapsed'])
def pad(s):
if len(s) == 4:
return '00:0'+s
elif len(s) == 5:
return '00:'+s
return s
print(pd.to_timedelta(df['elapsed'].apply(pad)))
Output:
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 00:03:22
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Name: elapsed, dtype: timedelta64[ns]
Alternatively to grovina's answer ... instead of using apply you can directly use the dt accessor.
Here's a sample:
>>> data = [['2017-12-01'], ['2017-12-
30'],['2018-01-01']]
>>> df = pd.DataFrame(data=data,
columns=['date'])
>>> df
date
0 2017-12-01
1 2017-12-30
2 2018-01-01
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: object
Note how df.date is an object? Let's turn it into a date like you want
>>> df.date = pd.to_datetime(df.date)
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: datetime64[ns]
The format you want is for string formatting. I don't think you'll be able to convert the actual datetime64 to look like that format. For now, let's make a newly formatted string version of your date in a separate column
>>> df['new_formatted_date'] =
df.date.dt.strftime('%d/%m/%y %H:%M')
>>> df.new_formatted_date
0 01/12/17 00:00
1 30/12/17 00:00
2 01/01/18 00:00
Name: new_formatted_date, dtype: object
Finally, since the df.date column is now of date datetime64... you can use the dt accessor right on it. No need to use apply
>>> df['month'] = df.date.dt.month
>>> df['day'] = df.date.dt.day
>>> df['year'] = df.date.dt.year
>>> df['hour'] = df.date.dt.hour
>>> df['minute'] = df.date.dt.minute
>>> df
date new_formatted_date month day
year hour minute
0 2017-12-01 01/12/17 00:00 12
1 2017 0 0
1 2017-12-30 30/12/17 00:00 12
30 2017 0 0
2 2018-01-01 01/01/18 00:00
Another idea is test if double : and if not added :00 with converting to timedeltas by to_timedelta, also is test if number before first : is less like 23 - then is parsing like HH:MM, if is greater is parising like MM:SS:
m1 = df['Total call time'].str.count(':').ne(2)
m2 = df['Total call time'].str.extract('^(\d+):', expand=False).astype(float).gt(23)
s = np.select([m1 & m2, m1 & ~m2],
['00:' + df['Total call time'], df['Total call time']+ ':00'],
df['Total call time'] )
df['Total call time'] = pd.to_timedelta(s)
print (df)
Total call time
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 03:22:00
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

Python: how to groupby a pandas dataframe to count by hour and day?

I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.

Python Pandas groupby multiple counts

I have a dataframe that looks like:
id email domain created_at company
0 1 son#mail.com old.com 2017-01-21 18:19:00 company_a
1 2 boy#mail.com new.com 2017-01-22 01:19:00 company_b
2 3 girl#mail.com nadda.com 2017-01-22 01:19:00 no_company
I need summarize the data by Year, Month and if the company has a value that doesn't match "no_company":
Desired output:
year month company count
2017 1 has_company 2
no_company 1
The following works great but gives me the count for each value in the company column;
new_df = test_df['created_at'].groupby([test_df.created_at.dt.year, test_df.created_at.dt.month, test_df.company]).agg('count')
print(new_df)
result:
year month company
2017 1 company_a 1
company_b 1
no_company 1
Map a new series for has_company/no_company then groupby:
c = df.company.map(lambda x: x if x == 'no_company' else 'has_company')
y = df.created_at.dt.year.rename('year')
m = df.created_at.dt.month.rename('month')
df.groupby([y, m, c]).size()
year month company
2017 1 has_company 2
no_company 1
dtype: int64

Categories