lagging parameters in panda? - python

I'm new to panda. I have a dataframe of TTI which is sorted by hour of a day for many years. I want to add a new column showing last year's tti value for each value. I wrote this code:
import pandas as pd
tti = pd.read_csv("c:\\users\\Mehrdad\\desktop\\Hourly_TTI.csv")
tti['new_date'] = pd.to_datetime(tti['Date'])
tti['last_year'] = tti['TTI'].shift(1,freq='1-Jan-2009')
print tti.head(10)
but I don't know how to define frequency value for shift! So that it would shift my data for one year behind my first date which is 01-01-2010.!?

df['last_year'] = df['date'].apply(lambda x: x - pd.DateOffset(years=1))
df['new_value'] = df.loc[df['last_year'],:]
df.shift can only move by a fixed distance.
Use offset to create a new datetime index and retrieve the value using the new index. Be aware to truncate the date of the first year.

Related

Time Series Lag Features Extraction

Trying to use the shift function for Feature Extraction to create 3 additional columns: same day last week, same day last month, same day last year. Data I am using is found here
Initially, I am trying to just use the shift function before creating a new column.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['day'] = pd.to_datetime(data['day'])
data.info()
the_7_days_diff = data['day'] - data.shift(freq='7D')['day']
Getting an error "This method is only implemented for DatetimeIndex, PeriodIndex and TimedeltaIndex; Got type RangeIndex"
Any help would be appreciated to understand what i am doing wrong.
The error implies that shift is applied on the index of the dataframe, not the value. You need to set the timestamp column as index after converting it to datetime data type.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data = data.set_index('timestamp')
week_diff = (data - data.shift(freq='7D')).dropna()

Creating a matplotlib line graph using datetime objects while ignoring the year value

I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
As #Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005) isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29 which will throw
ValueError: day is out of range for month. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()

Pandas DateTime index, assign value to column based on based on time AND day filter

Im trying to assign a value to a column based on a day and time filter.
Lets say I create a dataframe:
import pandas as pd
from datetime import date
date_range = pd.DataFrame({'date': pd.date_range(date(2019,8,30), date.today(), freq='15T')})
date_range.index = date_range['date']
I can then filter for the day (Sunday) and assign a value:
date_range['Happy Hour'] = ((pd.DatetimeIndex(date_range.index).dayofweek) // 6 == 'Yes!!')
and I can also filter a timeframe and assign a value
date_range.loc[date_range.between_time('15:00','18:59').index, 'Happy Hour'] = 'Yes!!'
But surely there is an easy way to combine these into one line of code so every Sunday between 15:00 and 19:00 the Happy Hour column gets filled with 'Yes!!'
First filter Sundays and then filter times by:
idx = date_range[date_range.index.dayofweek == 6].between_time('15:00','18:59').index
date_range.loc[idx, 'Happy Hour'] = 'Yes!!'
print (date_range)

How can i set a date interval for my data in python?

Im desperatly trying group my data inorder to see which months most people travel but first i want to remove all the data from before a certain year.
As you can see in the picture, i've data all the way back to the year 0003 which i do not want to include.
How can i set an interval from 1938-01-01 to 2020-09-21 with pandas and datetime
My_Code
One way to solve this is:
Verify that the date is on datetime format (its neccesary to convert this)
df.date_start = pd.to_datetime(df.date_start)
Set date_start as new index:
df.index = df.date_start
Apply this
df.groupby([pd.Grouper(freq = "1M"), "country_code"]) \
.agg({"Name of the column with frequencies": np.sum})
Boolean indexing with pandas.DataFrame.Series.between
# sample data
df = pd.DataFrame(pd.date_range('1910-01-01', '2020-09-21', periods=10), columns=['Date'])
# boolean indexing with DataFrame.Series.between
new_df = df[df['Date'].between('1938-01-01', '2020-09-21')]
# groupby month and get the count of each group
months = new_df.groupby(new_df['Date'].dt.month).count()

loc Funktion next value

I have a little problem with the .loc function.
Here is the code:
date = df.loc [df ['date'] == d] .index [0]
d is a specific date (e.g. 21.11.2019)
The problem is that the weekend can take days. In the dataframe in the column date there are no values for weekend days. (contains calendar days for working days only)
Is there any way that if d is on the weekend he'll take the next day?
I would have something like index.get_loc, method = bfill
Does anyone know how to implement that for .loc?
IIUC you want to move dates of format: dd.mm.yyyy to nearest Monday, if they happen to fall during the weekend, or leave them as they are, in case they are workdays. The most efficient approach will be to just modify d before you pass it to pandas.loc[...] instead of looking for the nearest neighbour.
What I mean is:
import datetime
d="22.12.2019"
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
if(dt.weekday() in [5,6]):
dt=dt+datetime.timedelta(days=7-dt.weekday())
d=dt.strftime("%d.%m.%Y")
Output:
23.12.2019
Edit
In order to just take first date, after or on d, which has entry in your dataframe try:
import datetime
df['date']=pd.to_datetime(df['date'], format='%d.%m.%Y')
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
d=df.loc[df ['date'] >= d, 'date'].min()
dr.loc[df['date']==d]...
...

Categories