Trying to use the shift function for Feature Extraction to create 3 additional columns: same day last week, same day last month, same day last year. Data I am using is found here
Initially, I am trying to just use the shift function before creating a new column.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['day'] = pd.to_datetime(data['day'])
data.info()
the_7_days_diff = data['day'] - data.shift(freq='7D')['day']
Getting an error "This method is only implemented for DatetimeIndex, PeriodIndex and TimedeltaIndex; Got type RangeIndex"
Any help would be appreciated to understand what i am doing wrong.
The error implies that shift is applied on the index of the dataframe, not the value. You need to set the timestamp column as index after converting it to datetime data type.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data = data.set_index('timestamp')
week_diff = (data - data.shift(freq='7D')).dropna()
Related
I have a date column in my dataframe and I am trying to create a new column ('delta_days') that has the difference (in days) between the current row and the previous row.
# Find amount of days difference between dates
for i in df:
new_date = date(df.iloc[i,'date'])
old_date = date(df.iloc[i-1,'date']) if i > 0 else date(df.iloc[0, 'date'])
df.iloc[i,'delta_days'] = new_date - old_date
I am using an iloc because I want to directly reference the 'date' column while i repersents the current row.
I am getting this error:
ValueError: Location based indexing can only have [integer, integer
slice (START point is INCLUDED, END point is EXCLUDED), listlike of
integers, boolean array] types
can someone please help
You can use pandas.DataFrame.shift method to achieve what you need.
Something more or less like this:
df['prev_date'] = df['date'].shift(1)
df['delta_days'] = df['date'] - df['prev_date']
Im desperatly trying group my data inorder to see which months most people travel but first i want to remove all the data from before a certain year.
As you can see in the picture, i've data all the way back to the year 0003 which i do not want to include.
How can i set an interval from 1938-01-01 to 2020-09-21 with pandas and datetime
My_Code
One way to solve this is:
Verify that the date is on datetime format (its neccesary to convert this)
df.date_start = pd.to_datetime(df.date_start)
Set date_start as new index:
df.index = df.date_start
Apply this
df.groupby([pd.Grouper(freq = "1M"), "country_code"]) \
.agg({"Name of the column with frequencies": np.sum})
Boolean indexing with pandas.DataFrame.Series.between
# sample data
df = pd.DataFrame(pd.date_range('1910-01-01', '2020-09-21', periods=10), columns=['Date'])
# boolean indexing with DataFrame.Series.between
new_df = df[df['Date'].between('1938-01-01', '2020-09-21')]
# groupby month and get the count of each group
months = new_df.groupby(new_df['Date'].dt.month).count()
I have a dataframe of downsampled Open/High/Low/Last/Change/Volume values for a security over ten years.
I'm trying to get the weekly count of samples i.e. how many samples did my downsampling method, in this case a Volume bar, sample per week over the entire dataset so that I can plot it and compare to other downsampling methods.
So far I've tried creating a series in the df called 'Year-Week' following the answers prescribed here and here.
The problem with these answers is that my EOY dates such as '1997-12-30' get transformed to '1997-01' because of the ISO calendar system used as described in this answer, which breaks my results when I apply the value_counts method.
My code is the following:
volumeBar['Year/Week'] = (pd.Series(volumeBar.index).dt.year.astype(str) + "/" + pd.Series(volumeBar.index).dt.week.astype(str)).values
So my question is: As it stand the following sample DateTimeIndex
Date
1997-12-22
1997-12-29
1997-12-30
becomes
Year/Week
1997/52
1997/1
1997/1
How could I get the following expected result?
Year/Week
1997/52
1997/52
1997/52
Please keep in mind that I cannot manually correct this behavior because of the size of the dataset and the erradict nature of these appearing results due to the way the ISO calendar works.
Many thanks in advance!
You can use the below function get_years_week to get years and weeks without ISO formating.
import pandas as pd
import datetime
a = {'Date': ['1997-11-29', '1997-12-22',
'1997-12-29',
'1997-12-30']}
data = pd.DataFrame(a)
data['Date'] = pd.to_datetime(data['Date'])
# Function for getting weeks and years
def get_years_week(data):
# Get year from date
data['year'] = data['Date'].dt.year
# loop over each row of date column and get week number
for i in range(len(data)):
data['week'] = (((data['Date'][i] - datetime.datetime\
(data['Date'][i].year,1,1)).days // 7) + 1)
# create column for week and year
data['year/week'] = pd.Series(data_2['year'].astype('str'))\
+ '/' + pd.Series(data_2['week'].astype('str'))
return data
I'm new to panda. I have a dataframe of TTI which is sorted by hour of a day for many years. I want to add a new column showing last year's tti value for each value. I wrote this code:
import pandas as pd
tti = pd.read_csv("c:\\users\\Mehrdad\\desktop\\Hourly_TTI.csv")
tti['new_date'] = pd.to_datetime(tti['Date'])
tti['last_year'] = tti['TTI'].shift(1,freq='1-Jan-2009')
print tti.head(10)
but I don't know how to define frequency value for shift! So that it would shift my data for one year behind my first date which is 01-01-2010.!?
df['last_year'] = df['date'].apply(lambda x: x - pd.DateOffset(years=1))
df['new_value'] = df.loc[df['last_year'],:]
df.shift can only move by a fixed distance.
Use offset to create a new datetime index and retrieve the value using the new index. Be aware to truncate the date of the first year.
I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.