Taking difference between column containing dates and today (using vetorization) - python

I am trying to avoid loops as this is just a subset of the bigger dataframe I have which has over 30k rows. All I want to do is create a new column with difference between the date in that row and today's date.
What's the best way to do it?
import pandas as pd
df = pd.DataFrame({'Date' : ['2014-03-27', '2014-03-28', '2014-03-31', '2014-04-01', '2014-04-02', '2014-04-03', '2014-04-04', '2014-04-07','2014-04-08', '2014-04-09'],
})

df['diff'] = (datetime.datetime.now()-pd.to_datetime(df['Date'])).dt.days

df['Date'] = pd.to_date_time(df['Date'])
df['num_days_diff'] = (np.datetime64('today', 'D') - df['Date'])/np.timedelta64(1, 'D')

Related

Pandas dataframe drop multiple rows based on datetime difference

I store datetimes in a pandas dataframe which look like dd/mm/yyyy hh:mm:ss
I want to drop all rows where values in column x (datetime) are within 24 hours of one another.
On a 1 by 1 basis, I was previously doing this, which doesn't seem to work within the drop function:
df.drop(df[(df['d2'] - df['d1']).seconds / 3600 < 24].index)
>> AttributeError: 'Series' object has no attribute 'seconds'
This should work
df.loc[ (df.d2 - df.d1) >= datetime.timedelta(days=1) ]
the answer is very easy
import pandas as pd
df = pd.read_csv("test.csv")
df["d1"] = pd.to_datetime(df["d1"])
df["d2"] = pd.to_datetime(df["d2"])
now if you tried to subtract columns from each other
df["first"] - df["second"]
output will be in days and hence and as what #kaan suggested
df.loc[(df["d2"] - df["d1"]) >= pd.Timedelta(days=1)]

How to create a "duration" column from two "dates" columns?

I have two columns ("basecamp_date" and "highpoint_date") in my "expeditions" dataframe, they have a start date (basecamp_date) and an end date ("highpoint_date") and I would like to create a new column that expresses the duration between these two dates but I have no idea how to do it.
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
In read_csv convert columns to datetimes and then subtrat columns with Series.dt.days for days:
file = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv"
expeditions = pd.read_csv(file, parse_dates=['basecamp_date','highpoint_date'])
expeditions['diff'] = expeditions['highpoint_date'].sub(expeditions['basecamp_date']).dt.days
You can convert those columns to datetime and then subtract them to get the duration:
tstart = pd.to_datetime(expeditions['basecamp_date'])
tend = pd.to_datetime(expeditions['highpoint_date'])
expeditions['duration'])= pd.Timedelta(tend - tstart)

Converting unix time into datetime within csv

i have a csv file with many lines and three column. first column is the unix time, second column the price, and third column represents the volume of the symbol that has been traded at that specific price. what i'm doing is, calculating ohlc for different time frames (e.g. 1h, 4h, 12h, 1d) out of tha csv file. that is working very well by first converting the unix time into datetime
code:
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['date'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('date')
df = df['price'].resample('4h').ohlc()
df.to_csv('file_4h_ohlc.csv')
result:
date,open,high,low,close
2017-05-01 20:00:00,0.757881,1.07,0.650011,1.069999
target:
i wanna now converte the datetime (2017-05-01 20:00:00) back to the unix time (1493658000) within the same file by keeping the ohlc values. or if not possible so, to save into a different file.
thanks a lot for support and sorry if such question has been already answered, but i didnt find it
-hotshot
You can create a new date column instead of overwriting the existing one, so you can re-use it as the index.
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['datestamp'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('datestamp')
df = df['price'].resample('4h').ohlc()
# Set the index back to the original (after calculating ohlc)
df = df.set_index('date')
# Optional: Drop the datestamp column
df = df.drop(columns=['datestamp'])
df.to_csv('file_4h_ohlc.csv')
Alternatively, you can convert the existing datetime column to a Unix timestamp like so:
df['date'].apply(lambda x : (x - datetime.datetime(1970, 1, 1)).total_seconds())

It is possible to create a new data frame on Pandas from a time series, with the daily diference?

Is there a way to create a new data frame from a time series with the daily diffence?
This means, suppose that on October 5 I had 5321 counts and on October 6 5331 counts. This represents the difference of 10; what I want is, for example, that my DataFrame shows 10 on October 6.
Here's my code of the raw dataframe:
import pandas as pd
from datetime import datetime, timedelta
url = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
df = pd.read_csv(url, index_col=0)
df = df.loc['Colima','18-03-2020':'06-10-2020']
df = pd.DataFrame(df)
df.index = pd.to_datetime(df.index, format='%d-%m-%Y')
df
This is the raw outcome:
Thank you guys!
There's an inbuilt diff function just for these kind of operations:
df['Diff'] = df.Colima.diff()
Yes, you can use the shift method to access the preceding row's value to calculate the difference.
df['difference'] = df.Colima - df.Colima.shift(1)

Python: Time Series with Pandas

I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):
Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736
df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()
The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?
stepwise_fit = auto_arima(df[Series_name]....
pandas has a built in function pandas.infer_freq()
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'
Alternatively you could also make use of the datetime functionality of the columns.
df.Date.dt.freq
#'MS'
Of course if your data doesn't actually have a real frequency, then you won't get anything.
pd.infer_freq(df.Date3)
#
The frequency descriptions are docmented under offset-aliases.

Categories