I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.
Related
Trying to use the shift function for Feature Extraction to create 3 additional columns: same day last week, same day last month, same day last year. Data I am using is found here
Initially, I am trying to just use the shift function before creating a new column.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['day'] = pd.to_datetime(data['day'])
data.info()
the_7_days_diff = data['day'] - data.shift(freq='7D')['day']
Getting an error "This method is only implemented for DatetimeIndex, PeriodIndex and TimedeltaIndex; Got type RangeIndex"
Any help would be appreciated to understand what i am doing wrong.
The error implies that shift is applied on the index of the dataframe, not the value. You need to set the timestamp column as index after converting it to datetime data type.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data = data.set_index('timestamp')
week_diff = (data - data.shift(freq='7D')).dropna()
I have a date column in my dataframe and I am trying to create a new column ('delta_days') that has the difference (in days) between the current row and the previous row.
# Find amount of days difference between dates
for i in df:
new_date = date(df.iloc[i,'date'])
old_date = date(df.iloc[i-1,'date']) if i > 0 else date(df.iloc[0, 'date'])
df.iloc[i,'delta_days'] = new_date - old_date
I am using an iloc because I want to directly reference the 'date' column while i repersents the current row.
I am getting this error:
ValueError: Location based indexing can only have [integer, integer
slice (START point is INCLUDED, END point is EXCLUDED), listlike of
integers, boolean array] types
can someone please help
You can use pandas.DataFrame.shift method to achieve what you need.
Something more or less like this:
df['prev_date'] = df['date'].shift(1)
df['delta_days'] = df['date'] - df['prev_date']
I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64
So I have a pandas date_range like so
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
I want to remove all the extra days resulting from leap years.
I do a for loop
for each in index:
if each.month==2 and each.day==29:
print(each) # I actually want to delete this item from dates
But my problem is that I don't know how to delete the item. The regular python list methods and functions doesn't work.
I've looked everywhere on SO. I've looked at the documentation for pandas.date_range but found nothing
Any help will be appreciated.
You probably want to use drop to remove the rows.
import pandas as pd
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
leap = []
for each in dates:
if each.month==2 and each.day ==29:
leap.append(each)
dates = dates.drop(leap)
You could try creating two Series objects to store the months and days separately and use them as masks.
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D') #All dates between range
days = dates.day #Store all the days
months = dates.month #Store all the months
dates = dates[(days != 29) & (months != 2)] #Filter dates using a mask
Just to check if the approach works, If you change the != condition to ==, we can see the dates you wish to eliminate.
UnwantedDates = dates[(days == 29) & (months == 2)]
Output:
DatetimeIndex(['2008-02-29', '2012-02-29'], dtype='datetime64[ns]', freq=None)
You can try:
dates = dates[~dates['Date'].str.contains('02-29')]
In place of Date you will have to put the name of the column where the dates are stored.
You don't have to use the for loop so it is faster to run.
I have a multi index series. One of the indices is day and I try to go through and get the data in a day range. I have looked into using just rolling with a time given as a string, but it returns a list of the same length whereas I only need 1 response per unique date index.
This is my current code, is there a simpler way to do this:
result = {}
for date in df.index.levels[2]: #this goes through all of the days
pre_date = date - np.timedelta64(window,'D') #find window days ago
cur_df = df.loc[idx[:,:,pre_dat:date],:] #get all data in that window day range
result[date] = f(cur_df)
result = pd.Series(result)