I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
Related
I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1
In my CSV file I have a column with date and time with the format 6/1/2019 12:00:00 AM.
My requirement is to remove time from all rows, then row will have only date. After this I have to subtract all rows from base date 1/1/2019 so the row should have only number of days. Here for e.g if we subtract 6/1/2019 from 1/1/2019 the row will have the value 6.
I tried below code to remove time.
import pandas as pd
df = pd.read_csv('sample.csv', header = 0)
from datetime import datetime,date
df['date'] = pd.to_datetime(df['date']).dt.date
How to subtract date 1/1/2019 from each row in the column and get the days in number using pandas and python datetime library?
Remove times by Series.dt.floor from datetimes (convert them to 00:00:00) and subtract datetime, last convert output timedeltas to days by Series.dt.days:
df = pd.read_csv('sample.csv', header = 0, parse_dates=['date'])
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
Sample:
df = pd.DataFrame({'date': pd.date_range('2019-01-06 12:00:00', periods=10)})
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
print (df)
date days
0 2019-01-06 12:00:00 5
1 2019-01-07 12:00:00 6
2 2019-01-08 12:00:00 7
3 2019-01-09 12:00:00 8
4 2019-01-10 12:00:00 9
5 2019-01-11 12:00:00 10
6 2019-01-12 12:00:00 11
7 2019-01-13 12:00:00 12
8 2019-01-14 12:00:00 13
9 2019-01-15 12:00:00 14
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1