Remove all rows between two timestamps from pandas dataframe - python

How to delete all rows from dataframe between two timestamps inclusive?
my Dataframe looks like :
b a
0 2016-12-02 22:00:00 19.218519
1 2016-12-02 23:00:00 19.171197
2 2016-12-03 00:00:00 19.257836
3 2016-12-03 01:00:00 19.195610
4 2016-12-03 02:00:00 19.176413
For eg : I want to delete all rows from above dataframe whose timestamp falls is in between : "2016-12-02 22:00:00" to "2016-12-03 00:00:00".
So, the result will contain only rows 3 and 4.
the type of b column is datetime64 and the type of a is float.
Please suggest.

You can filter those out:
from_ts = '2016-12-02 22:00:00'
to_ts = '2016-12-03 00:00:00'
df = df[(df['b'] < from_ts) | (df['b'] > to_ts)]

Convert the column b to datetime and then apply mask
df.b = pd.to_datetime(df.b, format = '%Y-%m-%d %H:%M:%S')
df[(df.b < '2016-12-02 22:00:00') | (df.b > '2016-12-03 00:00:00')]
b a
3 2016-12-03 01:00:00 19.195610
4 2016-12-03 02:00:00 19.176413

index_list= df.b[(df.b >= "2016-12-02 22:00:00") & (df.b <= "2016-12-03 00:00:00")].index.tolist()
df.drop(df.index[index_list] , inplace = True)

Related

Convert pandas dataframe hourly values in column names (H1, H2,... ) to a series in a separate column

I am trying to convert a dataframe in which hourly data appears in distinct columns, like here:
... to a dataframe that only contains two columns ['datetime', 'value'].
For example:
Datetime
value
2020-01-01 01:00:00
0
2020-01-01 02:00:00
0
...
...
2020-01-01 09:00:00
106
2020-01-01 10:00:00
2852
Any solution without using a for-loop?
Use DataFrame.melt with convert values to datetimes and add hours by to_timedelta with remove H:
df = df.melt('Date')
td = pd.to_timedelta(df.pop('variable').str.strip('H').astype(int), unit='H')
df['Date'] = pd.to_datetime(df['Date']) + td
You can do it by applying several function to DataFrame:
from datetime import datetime
# Example DataFrame
df = pd.DataFrame({'date': ['1/1/2020', '1/2/2020', '1/3/2020'],
'h1': [0, 222, 333],
'h2': [44, 0, 0],
"h3": [1, 2, 3]})
# To simplify I used only hours in range 1...3, so You must change it to 25
HOURS_COUNT = 4
df["hours"] = df.apply(lambda row: [h for h in range(1, HOURS_COUNT)], axis=1)
df["hour_values"] = df.apply(lambda row: {h: row[f"h{h}"] for h in range(1, HOURS_COUNT)}, axis=1)
df = df.explode("hours")
df["value"] = df.apply(lambda row: row["hour_values"][row["hours"]], axis=1)
df["date_full"] = df.apply(lambda row: datetime.strptime(f"{row['date']} {row['hours']}", "%m/%d/%Y %H"), axis=1)
df = df[["date_full", "value"]]
df = df.loc[df["value"] > 0]
So initial DataFrame is:
date h1 h2 h3
0 1/1/2020 0 44 1
1 1/2/2020 222 0 2
2 1/3/2020 333 0 3
And result DataFrame is:
date_full value
0 2020-01-01 02:00:00 44
0 2020-01-01 03:00:00 1
1 2020-01-02 01:00:00 222
1 2020-01-02 03:00:00 2
2 2020-01-03 01:00:00 333
2 2020-01-03 03:00:00 3

How to homogenize date type in a pandas dataframe column?

I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)

How can I print certain rows from a CSV in pandas

My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?
One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3
Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01

pandas: selecting rows in a specific time window

I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.

Pandas column date transformation

I have a pandas dataframe with a date column the data type is datetime64[ns]. there are over 1000 observations in the dataframe. I want to transform the following column:
date
2013-05-01
2013-05-01
to
date
05/2013
05/2013
or
date
05-2013
05-2013
EDIT//
this is my sample code as of now
test = pd.DataFrame({'a':['07/2017','07/2017',pd.NaT]})
a
0 2017-07-13
1 2017-07-13
2 NaT
test['a'].apply(lambda x: x if pd.isnull(x) == True else x.strftime('%Y-%m'))
0 2017-07-01
1 2017-07-01
2 NaT
Name: a, dtype: datetime64[ns]
why did only the date change and not the format?
You can convert datetime64 into whatever string format you like using the strftime method. In your case you would apply it like this:
df.date = df.date[df.date.notnull()].map(lambda x: x.strftime('%m/%Y'))
df.date
Out[111]:
0 05/2013
1 05/2013

Categories