Trying to find the difference in days between 2 dates - python

I have a date column in my dataframe and I am trying to create a new column ('delta_days') that has the difference (in days) between the current row and the previous row.
# Find amount of days difference between dates
for i in df:
new_date = date(df.iloc[i,'date'])
old_date = date(df.iloc[i-1,'date']) if i > 0 else date(df.iloc[0, 'date'])
df.iloc[i,'delta_days'] = new_date - old_date
I am using an iloc because I want to directly reference the 'date' column while i repersents the current row.
I am getting this error:
ValueError: Location based indexing can only have [integer, integer
slice (START point is INCLUDED, END point is EXCLUDED), listlike of
integers, boolean array] types
can someone please help

You can use pandas.DataFrame.shift method to achieve what you need.
Something more or less like this:
df['prev_date'] = df['date'].shift(1)
df['delta_days'] = df['date'] - df['prev_date']

Related

Time Series Lag Features Extraction

Trying to use the shift function for Feature Extraction to create 3 additional columns: same day last week, same day last month, same day last year. Data I am using is found here
Initially, I am trying to just use the shift function before creating a new column.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['day'] = pd.to_datetime(data['day'])
data.info()
the_7_days_diff = data['day'] - data.shift(freq='7D')['day']
Getting an error "This method is only implemented for DatetimeIndex, PeriodIndex and TimedeltaIndex; Got type RangeIndex"
Any help would be appreciated to understand what i am doing wrong.
The error implies that shift is applied on the index of the dataframe, not the value. You need to set the timestamp column as index after converting it to datetime data type.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data = data.set_index('timestamp')
week_diff = (data - data.shift(freq='7D')).dropna()

Make conditional changes to numerous dates

I'm sure this is really easy to answer but I have only just started using Pandas.
I have a column in my excel file called 'Day' and a Date/time column called 'Date'.
I want to update my Day column with the corresponding day of NUMEROUS Dates from the 'Date' column.
So far I use this code shown below to change the date/time to just date
df['Date'] = pd.to_datetime(df.Date).dt.strftime('%d/%m/%Y')
And then use this code to change the 'Day' column to Tuesday
df.loc[df['Date'] == '02/02/2018', 'Day'] = '2'
(2 signifies the 2nd day of the week)
This works great. The problem is, my excel sheet has 500000+ rows of data and lots of dates. Therefore I need this code to work with numerous dates (4 different dates to be exact)
For example; I have tried this code;
df.loc[df['Date'] == '02/02/2018' + '09/02/2018' + '16/02/2018' + '23/02/2018', 'Day'] = '2'
Which does not give me an error, but does not change the date to 2. I know I could just use the same line of code numerous times and change the date each time...but there must be a way to do it the way I explained? Help would be greatly appreciated :)
2/2/2018 is a Friday so I don't know what "2nd day in a week" mean. Does your week starts on Thursday?
Since you have already converted day to Timestamp, use the dt accessor:
df['Day'] = df['Date'].dt.dayofweek()
Monday is 0 and Sunday = 6. Manipulate that as needed.
If got it right, you want to change the Day column for just a few Dates, right? If so, you can just include these dates in a separated list and do
my_dates = ['02/02/2018', '09/02/2018', '16/02/2018', '23/02/2018']
df.loc[df['Date'].isin(my_dates), 'Day'] = '2'

lagging parameters in panda?

I'm new to panda. I have a dataframe of TTI which is sorted by hour of a day for many years. I want to add a new column showing last year's tti value for each value. I wrote this code:
import pandas as pd
tti = pd.read_csv("c:\\users\\Mehrdad\\desktop\\Hourly_TTI.csv")
tti['new_date'] = pd.to_datetime(tti['Date'])
tti['last_year'] = tti['TTI'].shift(1,freq='1-Jan-2009')
print tti.head(10)
but I don't know how to define frequency value for shift! So that it would shift my data for one year behind my first date which is 01-01-2010.!?
df['last_year'] = df['date'].apply(lambda x: x - pd.DateOffset(years=1))
df['new_value'] = df.loc[df['last_year'],:]
df.shift can only move by a fixed distance.
Use offset to create a new datetime index and retrieve the value using the new index. Be aware to truncate the date of the first year.

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Pandas: select all dates with specific month and day

I have a dataframe full of dates and I would like to select all dates where the month==12 and the day==25 and add replace the zero in the xmas column with a 1.
Anyway to do this? the second line of my code errors out.
df = DataFrame({'date':[datetime(2013,1,1).date() + timedelta(days=i) for i in range(0,365*2)], 'xmas':np.zeros(365*2)})
df[df['date'].month==12 and df['date'].day==25] = 1
Pandas Series with datetime now behaves differently. See .dt accessor.
This is how it should be done now:
df.loc[(df['date'].dt.day==25) & (cust_df['date'].dt.month==12), 'xmas'] = 1
Basically what you tried won't work as you need to use the & to compare arrays, additionally you need to use parentheses due to operator precedence. On top of this you should use loc to perform the indexing:
df.loc[(df['date'].month==12) & (df['date'].day==25), 'xmas'] = 1
An update was needed in reply to this question. As of today, there's a slight difference in how you extract months from datetime objects in a pd.Series.
So from the very start, incase you have a raw date column, first convert it to datetime objects by using a simple function:
import datetime as dt
def read_as_datetime(str_date):
# replace %Y-%m-%d with your own date format
return dt.datetime.strptime(str_date,'%Y-%m-%d')
then apply this function to your dates column and save results in a new column namely datetime:
df['datetime'] = df.dates.apply(read_as_datetime)
finally in order to extract dates by day and month, use the same piece of code that #Shayan RC explained, with this slight change; notice the dt.datetime after calling the datetime column:
df.loc[(df['datetime'].dt.datetime.month==12) &(df['datetime'].dt.datetime.day==25),'xmas'] =1

Categories