Pandas dataframe drop multiple rows based on datetime difference - python

I store datetimes in a pandas dataframe which look like dd/mm/yyyy hh:mm:ss
I want to drop all rows where values in column x (datetime) are within 24 hours of one another.
On a 1 by 1 basis, I was previously doing this, which doesn't seem to work within the drop function:
df.drop(df[(df['d2'] - df['d1']).seconds / 3600 < 24].index)
>> AttributeError: 'Series' object has no attribute 'seconds'

This should work
df.loc[ (df.d2 - df.d1) >= datetime.timedelta(days=1) ]

the answer is very easy
import pandas as pd
df = pd.read_csv("test.csv")
df["d1"] = pd.to_datetime(df["d1"])
df["d2"] = pd.to_datetime(df["d2"])
now if you tried to subtract columns from each other
df["first"] - df["second"]
output will be in days and hence and as what #kaan suggested
df.loc[(df["d2"] - df["d1"]) >= pd.Timedelta(days=1)]

Related

How to create a "duration" column from two "dates" columns?

I have two columns ("basecamp_date" and "highpoint_date") in my "expeditions" dataframe, they have a start date (basecamp_date) and an end date ("highpoint_date") and I would like to create a new column that expresses the duration between these two dates but I have no idea how to do it.
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
In read_csv convert columns to datetimes and then subtrat columns with Series.dt.days for days:
file = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv"
expeditions = pd.read_csv(file, parse_dates=['basecamp_date','highpoint_date'])
expeditions['diff'] = expeditions['highpoint_date'].sub(expeditions['basecamp_date']).dt.days
You can convert those columns to datetime and then subtract them to get the duration:
tstart = pd.to_datetime(expeditions['basecamp_date'])
tend = pd.to_datetime(expeditions['highpoint_date'])
expeditions['duration'])= pd.Timedelta(tend - tstart)

How to join pandas Series of numbers to make it one number

I'm using Pandas library.
I have three columns in dataset named 'hours', 'minutes' and 'seconds'
I want to join the three columns to make it in time format.
For e.g the first column should read as 9:33:09
How can I do that?
Convert to timedelta and add -
pd.to_timedelta(df["hours"], unit='h') + pd.to_timedelta(df["minutes"], unit='m') + pd.to_timedelta(df["sec"], unit='S')
Viewing you example, I think that the sec column is actually microseconds, if that's the case use -
pd.to_timedelta(df["hours"], unit='h') + pd.to_timedelta(df["minutes"], unit='m') + pd.to_timedelta(df["sec"], unit='us')
You can use string operations and pandas for this.
import pandas as pd
# Read csv
data=pd.read_csv("data.csv")
# Create a DataFrame object
df=pd.DataFrame(data,columns=["hour","mins","sec"])
# Iterate through records and print the values.
for ind in df.index:
hour=str(df['hour'][ind])
min=str(df['mins'][ind])
sec=str(df['sec'][ind])
sec=sec[:len(sec)-4]
if(len(sec)==1):
sec="0"+sec
print(hour+":"+min+":"+sec)
Output:
HH:MM:SS
It appends 0 if seconds are of 1 digit.

Select Pandas dataframe rows based on 'hour' datetime

I have a pandas dataframe 'df' with a column 'DateTimes' of type datetime.time.
The entries of that column are hours of a single day:
00:00:00
.
.
.
23:59:00
Seconds are skipped, it counts by minutes.
How can I choose rows by hour, for example the rows between 00:00:00 and 00:01:00?
If I try this:
df.between_time('00:00:00', '00:00:10')
I get an error that index must be a DateTimeIndex.
I set the index as such with:
df=df.set_index(keys='DateTime')
but I get the same error.
I can't seem to get 'loc' to work either. Any suggestions?
Here a working example of what you are trying to do:
times = pd.date_range('3/6/2012 00:00', periods=100, freq='S', tz='UTC')
df = pd.DataFrame(np.random.randint(10, size=(100,1)), index=times)
df.between_time('00:00:00', '00:00:30')
Note the index has to be of type DatetimeIndex.
I understand you have a column with your dates/times. The problem probably is that your column is not of this type, so you have to convert it first, before setting it as index:
# Method A
df.set_index(pd.to_datetime(df['column_name'], drop=True)
# Method B
df.index = pd.to_datetime(df['column_name'])
df = df.drop('col', axis=1)
(The drop is only necessary if you want to remove the original column after setting it as index)
Check out these links:
convert column to date type: Convert DataFrame column type from string to datetime
filter dataframe on dates: Filtering Pandas DataFrames on dates
Hope this helps

How to compare date difference in python pandas

I'm appending a column to my pandas dataframe which is the time difference between two dates.
df['time_diff'] = datetime.dt(2018,1,1) - df['IN_TIME']
the type if the new column in <m8[ns]. I'm trying to filter the rows whose 'time_diff' is greater than 30 days but I can't compare <m8[ns] with a number. How can I do this comparison?
Here's one way. Note you don't need to use the datetime module for these calculations as Pandas has some intuitive functionality for these operations.
df['time_diff'] = pd.to_datetime('2018-01-01') - df['IN_TIME']
df = df[df['time_diff'].dt.days > 30]
This solution assumes df['IN_TIME'] is a datetime series; if it is not, you can convert via df['IN_TIME'] = pd.to_datetime(df['IN_TIME']).

Pandas - Python, deleting rows based on Date column

I'm trying to delete rows of a dataframe based on one date column; [Delivery Date]
I need to delete rows which are older than 6 months old but not equal to the year '1970'.
I've created 2 variables:
from datetime import date, timedelta
sixmonthago = date.today() - timedelta(188)
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
but I don't know how to delete rows based on these two variables, using the [Delivery Date] column.
Could anyone provide the correct solution?
You can just filter them out:
df[(df['Delivery Date'].dt.year == 1970) | (df['Delivery Date'] >= sixmonthago)]
This returns all rows where the year is 1970 or the date is less than 6 months.
You can use boolean indexing and pass multiple conditions to filter the df, for multiple conditions you need to use the array operators so | instead of or, and parentheses around the conditions due to operator precedence.
Check the docs for an explanation of boolean indexing
Be sure the calculation itself is accurate for "6 months" prior. You may not want to be hardcoding in 188 days. Not all months are made equally.
from datetime import date
from dateutil.relativedelta import relativedelta
#http://stackoverflow.com/questions/546321/how-do-i-calculate-the-date-six-months-from-the-current-date-using-the-datetime
six_months = date.today() - relativedelta( months = +6 )
Then you can apply the following logic.
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
df = df[(df['Delivery Date'].dt.year == nineteen_seventy.tm_year) | (df['Delivery Date'] >= six_months)]
If you truly want to drop sections of the dataframe, you can do the following:
df = df[(df['Delivery Date'].dt.year != nineteen_seventy.tm_year) | (df['Delivery Date'] < six_months)].drop(df.columns)

Categories