Formating a calendar table to a datetime dataframe - python

I have a calendar data in the following format:
df = pd.read_csv('2021.txt', sep=" ")
df.head()
I'd like to have it as:
Date y
2021-01-01 17:26
2021-01-02 17:27
2021-01-03 17:28
2021-01-04 17:28
...
2021-12-31 17:25
I've searched and found no similar questions. I'm trying to provide a minimal example but don't know where to start. I know I have to use pandas.to_datetime function but I don't even know how to apply it in this case because everything is separated.

Use DataFrame.melt with to_datetime and errros='coerce' for convert wrong datetimes like 2021-02-30 to missing values and then remove this rows by DataFrame.dropna:
df1 = df.melt('Day', var_name='Date', value_name='y')
df1['Date'] = pd.to_datetime('2021' + df1['Date'] + df1.pop('Day').astype(str),
format='%Y%b%d', errors='coerce')
df1 = df1.dropna(subset=['Date'])
print (df1)
Date y
0 2021-01-01 17:28
1 2021-01-02 17:27
2 2021-01-03 17:28
3 2021-01-04 17:28
4 2021-01-05 17:29
.. ... ...
67 2021-12-02 17:15
68 2021-12-03 17:15
69 2021-12-04 17:15
70 2021-12-05 17:15
71 2021-12-06 17:15
[72 rows x 2 columns]

Related

New column for quarter of year from datetime col

I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1

Pandas : Dataframe Output System Down Time

I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.

Resample pandas dataframe by two columns

I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1

split datetime column into date and time columns in pandas

I have a following question. I have a date_time column in my dataframe (and many other columns).
df["Date_time"].head()
0 2021-05-15 09:54
1 2021-05-27 17:04
2 2021-05-27 00:00
3 2021-05-27 09:36
4 2021-05-26 18:39
Name: Date_time, dtype: object
I would like to split this column into two (date and time).
I use this formula that works fine:
df["Date"] = ""
df["Time"] = ""
def split_date_time(data_frame):
for i in range(0, len(data_frame)):
df["Date"][i] = df["Date_time"][i].split()[0]
df["Time"][i] = df["Date_time"][i].split()[1]
split_date_time(df)
But is there a more elegant way? Thanks
dt accessor can give you date and time separately:
df["Date"] = df["Date_time"].dt.date
df["Time"] = df["Date_time"].dt.time
to get
>>> df
Date_time Date Time
0 2021-05-15 09:54:00 2021-05-15 09:54:00
1 2021-05-27 17:04:00 2021-05-27 17:04:00
2 2021-05-27 00:00:00 2021-05-27 00:00:00
3 2021-05-27 09:36:00 2021-05-27 09:36:00
4 2021-05-26 18:39:00 2021-05-26 18:39:00

Pandas how to copy a column to another dataframe with similar index

I have one Pandas dataframe like below. I used pd.to_datetime(df['date']).dt.normalize() to get the date2 column to show just the date and ignore the time. Wasn't sure how to have it be just YYYY-MM-DD format.
date2 count compound_mean
0 2021-01-01 00:00:00+00:00 18 0.188411
1 2021-01-02 00:00:00+00:00 9 0.470400
2 2021-01-03 00:00:00+00:00 10 0.008190
3 2021-01-04 00:00:00+00:00 58 0.187510
4 2021-01-05 00:00:00+00:00 150 0.176173
Another dataframe with the following format.
Date Average
2021-01-04 18.200001
2021-01-05 22.080000
2021-01-06 22.250000
2021-01-07 22.260000
2021-01-08 21.629999
I want to have the Average column show up in the first dataframe by matching the dates and then forward-filling any blank values. From 01-01 to 01-03 there will be nothing to forward fill, so I guess it will end up being zero. I'm having trouble finding the right Pandas functions to do this, looking for some guidance. Thank you.
I assume your first dataframe to be df1 and second dataframe to be df2.
Firstly, you need to change the name of the date2 column of df1 to Date so that it matches with your Date column of df2.
df1['Date'] = pd.to_datetime(df1['date2']).dt.date
You can then remove the date2 column of df1 as
df1.drop("date2",inplace=True, axis=1)
You also need to change the column type of Date of df2 so that it matches with type of df1's Date column
df2['Date'] = pd.to_datetime(df2['Date']).dt.date
Then make a new dataframe which will contain both dataframe columns based on Date column.
main_df = pd.merge(df1,df2,on="Date", how="left")
df1['Average'] = main_df['Average']
df1 = pd.DataFrame(df1, columns = ['Date', 'count','compound_mean','Average'])
You can then fill the null values by ffill and also the first 3 null values by 0
df1.fillna(method='ffill', inplace=True)
df1.fillna(0, inplace=True)
Your first dataframe will look what you wanted
Try the following:
>>> df.index = pd.to_datetime(df.date2).dt.date
# If df.date2 is already datetime, use ^ df.index = df.date2.dt.date
>>> df2['Date'] = pd.to_datetime(df2['Date'])
# If df2['Date'] is already datetime, ^ this above line is not needed
>>> df.join(df2.set_index('Date')).fillna(0)
date2 count compound_mean Average
date2
2021-01-01 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
2021-01-02 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2021-01-03 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
2021-01-04 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
2021-01-05 2021-01-05 00:00:00+00:00 150 0.176173 22.080000
You can perform merge operation as follows:
#Making date of same UTC format from both tables
df1['date2'] = pd.to_datetime(df1['date2'],utc = True)
df2['Date'] = pd.to_datetime(df2['Date'],utc = True)
#Renaming df1 column so that we can map 'Date' from both dataframes
df1.rename(columns={'date2': 'Date'},inplace=True)
#Merge operation
res = pd.merge(df1,df2,on='Date',how='left').fillna(0)
Output:
Date count compound_mean Average
0 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
1 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
3 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
4 2021-01-05 00:00:00+00:00 150 0.176173 22.080000

Categories