python pandas mean by hour of day - python

I'm working with the following dataset with hourly counts in columns. The dataframe has more than 1400 columns and 100 rows.
My dataset looks like this:
CITY 2019-10-01 00:00 2019-10-01 01:00 2019-10-01 02:00 .... 2019-12-01 12:00
Wien 15 16 16 .... 14
Graz 11 11 11 .... 10
Innsbruck 12 12 10 .... 12
....
How can I convert this datatime to datetime such as this:
CITY 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
(or 1 day) (or 2 day) (or 3 day) (or 72 day)
Wien 14 15 16 .... 12
Graz 13 12 14 .... 10
Innsbruck 13 12 12 .... 12
....
I would like the average of all hours of the day to be in the column of the one day.
The data type is:
type(df.columns[0])
out: str
type(df.columns[1])
out: pandas._libs.tslibs.timestamps.Timestamp
Thanks for your help!

I would do something like this:
days = df.columns[1:].to_series().dt.normalize()
df.set_index('CITY').groupby(days, axis=1).mean()
Output:
2019-10-01 2019-12-01
CITY
Wien 15.666667 14.0
Salzburg 12.000000 14.0
Graz 11.000000 10.0
Innsbruck 11.333333 12.0

Related

Convert 3 columns from dataframe to date

I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np

# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})

df.ID = pd.to_numeric(df.ID)

df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')

#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan

 #(try1)
df.yyyy = df.Date.astype(float)
 #(try2)
df.yyyy = pd.to_numeric(df.Date)
 #(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014

How can i customize the week number in Python?

Currently, the week number for the period of '2020-5-6' to '2020-5-19' is 20 and 21.
How do I customise it so that the week number is 1 and 2 instead, and also have the subsequent periods change accordingly.
My code:
import pandas as pd
df = pd.DataFrame({'Date':pd.date_range('2020-5-6', '2020-5-19')})
df['Period'] = df['Date'].dt.to_period('W-TUE')
df['Week_Number'] = df['Period'].dt.week
df.head()
print(df)
My output:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 20
1 2020-05-07 2020-05-06/2020-05-12 20
2 2020-05-08 2020-05-06/2020-05-12 20
3 2020-05-09 2020-05-06/2020-05-12 20
...
11 2020-05-17 2020-05-13/2020-05-19 21
12 2020-05-18 2020-05-13/2020-05-19 21
13 2020-05-19 2020-05-13/2020-05-19 21
What I want:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 1
1 2020-05-07 2020-05-06/2020-05-12 1
2 2020-05-08 2020-05-06/2020-05-12 1
3 2020-05-09 2020-05-06/2020-05-12 1
...
11 2020-05-17 2020-05-13/2020-05-19 2
12 2020-05-18 2020-05-13/2020-05-19 2
13 2020-05-19 2020-05-13/2020-05-19 2

Time difference within group by objects in Python Pandas

I have a dataframe that looks like this:
from to datetime other
-------------------------------------------------
11 1 2016-11-06 22:00:00 -
11 1 2016-11-06 20:00:00 -
11 1 2016-11-06 15:45:00 -
11 12 2016-11-06 15:00:00 -
11 1 2016-11-06 12:00:00 -
11 18 2016-11-05 10:00:00 -
11 12 2016-11-05 10:00:00 -
12 1 2016-10-05 10:00:59 -
12 3 2016-09-06 10:00:34 -
I want to groupby "from" and then "to" columns and then sort the "datetime" in descending order and then finally want to calculate the time difference within these grouped by objects between the current time and the next time. For eg, in this case,
I would like to have a dataframe like the following:
from to timediff in minutes others
11 1 120
11 1 255
11 1 225
11 1 0 (preferrably subtract this date from the epoch)
11 12 300
11 12 0
11 18 0
12 1 25
12 3 0
I can't get my head around figuring this out!! Is there a way out for this?
Any help will be much much appreciated!!
Thank you so much in advance!
df.assign(
timediff=df.sort_values(
'datetime', ascending=False
).groupby(['from', 'to']).datetime.diff(-1).dt.seconds.div(60).fillna(0))
I think you need:
groupby with apply sort_values with diff, convert Timedelta to minutes by seconds and floor division 60
fillna and sort_index, remove level 2 in index
df = df.groupby(['from','to']).datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.sort_index()
.reset_index(level=2, drop=True)
.reset_index(name='timediff in minutes')
print (df)
from to timediff in minutes
0 11 1 120.0
1 11 1 255.0
2 11 1 225.0
3 11 1 0.0
4 11 12 300.0
5 11 12 0.0
6 11 18 0.0
7 12 3 0.0
8 12 3 0.0
df = df.join(df.groupby(['from','to'])
.datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.reset_index(level=[0,1], drop=True)
.rename('timediff in minutes'))
print (df)
from to datetime other timediff in minutes
0 11 1 2016-11-06 22:00:00 - 120.0
1 11 1 2016-11-06 20:00:00 - 255.0
2 11 1 2016-11-06 15:45:00 - 225.0
3 11 12 2016-11-06 15:00:00 - 300.0
4 11 1 2016-11-06 12:00:00 - 0.0
5 11 18 2016-11-05 10:00:00 - 0.0
6 11 12 2016-11-05 10:00:00 - 0.0
7 12 3 2016-10-05 10:00:59 - 0.0
8 12 3 2016-09-06 10:00:34 - 0.0
Almost as above, but without apply:
result = df.sort_values(['from','to','datetime'])\
.groupby(['from','to'])['datetime']\
.diff().dt.seconds.fillna(0)

Changing time components of pandas datetime64 column

I have a dataframe that can be simplified as:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
It can be created using:
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"]})
The data has the following types:
tempDF.dtypes
date object
id int64
dtype: object
I have set the 'date' variable to be Pandas datefime64 format (if that's the right way to describe it) using:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
So now, the dtypes look like:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
I want to change the hours of the original date data. I can use .normalize() to convert to midnight via the .dt accessor:
tempDF['date'] = tempDF['date'].dt.normalize()
And, I can get access to individual datetime components (e.g. year) using:
tempDF['date'].dt.year
This produces:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
The question is, how can I change specific date and time components? For example, how could I change the midday (12:00) for all the dates? I've found that datetime.datetime has a .replace() function. However, having converted dates to Pandas format, it would make sense to keep in that format. Is there a way to do that without changing the format again?
EDIT :
A vectorized way to do this would be to normalize the series, and then add 12 hours to it using timedelta. Example -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Demo -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
Timing information for both methods at bottom
One method would be to use Series.apply along with the .replace() method OP mentions in his post. Example -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
Demo -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
Timing information
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop
Here's the solution I used to replace the time component of the datetime values in a Pandas DataFrame. Not sure how efficient this solution is, but it fit my needs.
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame({'Date': dtList})
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
Not sure how efficient this solution is, but it fit my needs.

Categories