I have a pandas dataframe which is having long term data,
point_id issue_date latitude longitude rainfall
0 1.0 2020-01-01 6.5 66.50 NaN
1 2.0 2020-01-02 6.5 66.75 NaN
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN
6373889 17415.0 2020-12-31 38.5 100.00 NaN
6373890 rows × 5 columns
I want to extract the Standard Meteorological Week from its issue_date column, as
given in this figure.
I have tried in 2 ways.
1st
lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')
2nd
lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week
The output in both cases is same
point_id issue_date latitude longitude rainfall smw iso
0 1.0 2020-01-01 6.5 66.50 NaN 01 1
1 2.0 2020-01-02 6.5 66.75 NaN 01 1
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN 53 53
6373889 17415.0 2020-12-31 38.5 100.00 NaN 53 53
6373890 rows × 7 columns
The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.
Like here in case of year 2020 the day on 1st January is Wednesday (not Monday),
so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).
year week day issue_date
0 2020 1 3 2020-01-01
1 2020 1 4 2020-01-02
2 2020 1 5 2020-01-03
3 2020 1 6 2020-01-04
... ... ... ...
6373889 2020 53 4 2020-12-31
But in the case of Standard Meteorological Weeks,
I want output as:
for every year
1st week should always be from - 1st January to 07th January
2nd week from - 8th January to 14th January
3rd week from - 15th January to 21st January
------------------------------- and so on
irrespective of the starting day of year (Sunday, monday etc).
How to do so?
Use:
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear
df['smw'] = np.where(df['issue_date'].dt.is_leap_year,
leap_year[days - 1],
normal_year[days - 1])
print (df[df['smw'] == 9])
issue_date smw
56 2000-02-26 9
57 2000-02-27 9
58 2000-02-28 9
59 2000-02-29 9
60 2000-03-01 9
61 2000-03-02 9
62 2000-03-03 9
63 2000-03-04 9
Performance:
#11323 rows
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})
In [6]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#51500 rows
df = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})
In [9]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
...:
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
...:
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can write a custom function to calculate the Standard Meteorological Weeks.
Normal calculation is by taking the difference in number of days from 1st January of the same year, then divide by 7 and add 1.
Special adjustment for leap year on Week No. 9 to have 8 days and also special adjustment for the last week of the year to have 8 days:
import numpy as np
# convert to datetime format if not already in datetime
df['issue_date'] = pd.to_datetime(df['issue_date'])
def get_smw(date_s):
# get day-of-the-year minus 1 in range [0..364/365] for division by 7
days_diff = date_s.dt.dayofyear - 1
# adjust for leap year on Week No. 9 to have 8 days: (minus one day for 29 Feb onwards in the same year)
leap_adj = date_s.dt.is_leap_year & (date_s > pd.to_datetime(date_s.dt.year.astype(str) + '-02-28'))
days_diff = np.where(leap_adj, days_diff - 1, days_diff)
# adjust for the last week of the year to have 8 days:
# Make the value for 31 Dec to 363 instead of 364 to keep it in the same week of 24 Dec)
days_diff = np.clip(days_diff, 0, 363)
smw = days_diff // 7 + 1
return smw
df['smw'] = get_smw(df['issue_date'])
Result:
print(df)
point_id issue_date latitude longitude rainfall smw
0 1.0 2020-01-01 6.5 66.50 NaN 1
1 2.0 2020-01-02 6.5 66.75 NaN 1
2 3.0 2020-01-03 6.5 66.75 NaN 1
3 4.0 2020-01-04 6.5 66.75 NaN 1
4 5.0 2020-01-05 6.5 66.75 NaN 1
5 6.0 2020-01-06 6.5 66.75 NaN 1
6 7.0 2020-01-07 6.5 66.75 NaN 1
7 8.0 2020-01-08 6.5 66.75 NaN 2
8 9.0 2020-01-09 6.5 66.75 NaN 2
40 40.0 2020-02-26 6.5 66.75 NaN 9
41 41.0 2020-03-04 6.5 66.75 NaN 9
42 42.0 2020-03-05 6.5 66.75 NaN 10
43 43.0 2020-03-12 6.5 66.75 NaN 11
6373880 17414.0 2020-12-23 38.5 99.75 NaN 51
6373881 17414.0 2020-12-24 38.5 99.75 NaN 52
6373888 17414.0 2020-12-30 38.5 99.75 NaN 52
6373889 17415.0 2020-12-31 38.5 100.00 NaN 52
7000040 40.0 2021-02-26 6.5 66.75 NaN 9
7000041 41.0 2021-03-04 6.5 66.75 NaN 9
7000042 42.0 2021-03-05 6.5 66.75 NaN 10
7000042 43.0 2021-03-12 6.5 66.75 NaN 11
7373880 17414.0 2021-12-23 38.5 99.75 NaN 51
7373881 17414.0 2021-12-24 38.5 99.75 NaN 52
7373888 17414.0 2021-12-30 38.5 99.75 NaN 52
7373889 17415.0 2021-12-31 38.5 100.00 NaN 52
Related
I have a pandas dataframe with multiple columns and I would like to create a new dataframe by flattening all columns into one using the melt function. But I do not want the column names from the original dataframe to be a part of the new dataframe.
Below is the sample dataframe and code. Is there a way to make it more concise?
date Col1 Col2 Col3 Col4
1990-01-02 12:00:00 24 24 24.8 24.8
1990-01-02 01:00:00 59 58 60 60.3
1990-01-02 02:00:00 43.7 43.9 48 49
The output desired:
Rates
0 24
1 59
2 43.7
3 24
4 58
5 43.9
6 24.8
7 60
8 48
9 24.8
10 60.3
11 49
Code :
df = df.melt(var_name='ColumnNames', value_name='Rates') #using melt function to flatten columns
df_main.drop(['ColumnNames'], axis = 1, inplace = True) # dropping 'ColumnNames'
Set value_name and value_vars params for your purpose:
In [137]: pd.melt(df, value_name='Price', value_vars=df.columns[1:]).drop('variable', axis=1)
Out[137]:
Price
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
As an alternative you can use stack() and transpose():
dfx = df.T.stack().reset_index(drop=True) #date must be index.
Output:
0
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN
How to add the day_of_week column(the day of the week, eg. 1 = Mon, 2 = Tue) to df1 according to the year,month,day values as shown below:
year month day A B C D day_of_week
0 2019 1 1 26.2 20.2 0.0 32.4 2
1 2019 1 2 22.9 20.3 0.0 10.0 3
2 2019 1 3 24.8 18.4 0.0 28.8 4
3 2019 1 4 26.6 18.3 0.0 33.5 5
4 2019 1 5 28.3 20.9 0.0 33.4 6
Use pd.to_datetime and .dt.dayofweek attribute and plus 1
df['day_of_week'] = pd.to_datetime(df[['year', 'month', 'day']],
errors='coerce').dt.dayofweek + 1
Out[410]:
year month day A B C D day_of_week
0 2019 1 1 26.2 20.2 0.0 32.4 2
1 2019 1 2 22.9 20.3 0.0 10.0 3
2 2019 1 3 24.8 18.4 0.0 28.8 4
3 2019 1 4 26.6 18.3 0.0 33.5 5
4 2019 1 5 28.3 20.9 0.0 33.4 6
If you also want the day name, you can make use of calendar.
import datetime
import calendar
def findDay(date):
day_number = datetime.datetime.strptime(date, '%d %m %Y').weekday() + 1
day_name = calendar.day_name[day_number-1]
return (day_number, day_name)
You can generate a date in str format using something like this date = f'{day} {month} {year}' from the year, month, day columns.
Then all you have to do is apply the function above on the new date column. The function will return a tuple with the day number of the week as well as the day name.
I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)
df = pd.DataFrame(dict(
list(
zip(["A", "B", "C"],
[np.array(["id %02d" % i for i in range(1, 11)]).repeat(10),
pd.date_range("2018-01-01", periods=100).strftime("%Y-%m-%d"),
[i for i in range(10, 110)]])
)
))
df = df.groupby(["A", "B"]).sum()
df["D"] = df["C"].shift(1).rolling(2).mean()
df
This code generates the following:
I want the rolling logic to start over for every new ID. Right now, ID 02 is using the last two values from ID 01 to calculate the mean.
How can this be achieved?
I believe you need groupby:
df['D'] = df["C"].shift(1).groupby(df['A'], group_keys=False).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 19.5
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5
Or:
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 NaN
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5
While the accepted answer by #jezrael works correctly for positive shifts, it gives incorrect result (partially) for negative shifts. Please check the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
df['E'] = df["C"].groupby(df['A']).rolling(2).mean().shift(1).values
df['F'] = df["C"].groupby(df['A']).shift(-1).rolling(2).mean()
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
df.set_index(['A', 'B'], inplace=True)
print(df.head(20))
C D E F G
A B
id 01 2018-01-01 10 NaN NaN NaN 10.5
2018-01-02 11 NaN NaN 11.5 11.5
2018-01-03 12 10.5 10.5 12.5 12.5
2018-01-04 13 11.5 11.5 13.5 13.5
2018-01-05 14 12.5 12.5 14.5 14.5
2018-01-06 15 13.5 13.5 15.5 15.5
2018-01-07 16 14.5 14.5 16.5 16.5
2018-01-08 17 15.5 15.5 17.5 17.5
2018-01-09 18 16.5 16.5 18.5 18.5
2018-01-10 19 17.5 17.5 NaN NaN
id 02 2018-01-11 20 NaN 18.5 NaN 20.5
2018-01-12 21 NaN NaN 21.5 21.5
2018-01-13 22 20.5 20.5 22.5 22.5
2018-01-14 23 21.5 21.5 23.5 23.5
2018-01-15 24 22.5 22.5 24.5 24.5
2018-01-16 25 23.5 23.5 25.5 25.5
2018-01-17 26 24.5 24.5 26.5 26.5
2018-01-18 27 25.5 25.5 27.5 27.5
2018-01-19 28 26.5 26.5 28.5 28.5
2018-01-20 29 27.5 27.5 NaN NaN
Note that columns D and E are computed for .shift(1) and columns F and G are computed for .shift(-1). Column E is incorrect, since the first value of id 02 uses last two values of id 01. Column F is incorrect since first values are NaNs for both id 01 and id 02. Columns D and G give correct results. So, the full answer should be like this. If shift period is non-negative, use the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
If shift period is negative, use the following
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
Hope it helps!