I'm working with the NYC MVA dataset. I've combined the CRASH DATE and CRASH TIME columns into a single column with format 2017-06-26 22:00:00. I'd now like to add a categorical column based on seasons. In order to do this, I'm looking to apply a mask to each season name and fill the column based on that, using the following basic structure:
df[df['CRASH TIME'].dt.date < dt.date(:,1,2)]
The problem is that the datetime date Timestamp requires a year input; the dataset spans a number of years. I would like to select all years, not any given year. In other words, I'd like to just select the month and date, and not the year. Is there a way to do that using datetime Timestamps?
assuming your using pandas for manipulating the data you could do something like this
df['day'] = df['CRASH TIME'].apply(lambda r:r.day)
df['month'] = df['CRASH TIME'].apply(lambda r:r.month)
Then you can combine them or work with them just as they are.
I'm not sure there's a way how to directly compare only a part of date, but you can extract the month and day into a tuple and compare them that way:
month_day_left = (df['CRASH TIME'].dt.date.month, df['CRASH TIME'].dt.date.day)
month_day_right = (dt.date.month, dt.date.day)
(2, 1) < (2, 2) # True
(1, 10) < (2, 1) # True
(2, 1) < (1, 30) # False
you can eventually wrap this comparison into a custom function and use it that way:
df[ is_earlier(df['CRASH TIME'].dt, dt)]
Related
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!
So I have a pandas date_range like so
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
I want to remove all the extra days resulting from leap years.
I do a for loop
for each in index:
if each.month==2 and each.day==29:
print(each) # I actually want to delete this item from dates
But my problem is that I don't know how to delete the item. The regular python list methods and functions doesn't work.
I've looked everywhere on SO. I've looked at the documentation for pandas.date_range but found nothing
Any help will be appreciated.
You probably want to use drop to remove the rows.
import pandas as pd
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
leap = []
for each in dates:
if each.month==2 and each.day ==29:
leap.append(each)
dates = dates.drop(leap)
You could try creating two Series objects to store the months and days separately and use them as masks.
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D') #All dates between range
days = dates.day #Store all the days
months = dates.month #Store all the months
dates = dates[(days != 29) & (months != 2)] #Filter dates using a mask
Just to check if the approach works, If you change the != condition to ==, we can see the dates you wish to eliminate.
UnwantedDates = dates[(days == 29) & (months == 2)]
Output:
DatetimeIndex(['2008-02-29', '2012-02-29'], dtype='datetime64[ns]', freq=None)
You can try:
dates = dates[~dates['Date'].str.contains('02-29')]
In place of Date you will have to put the name of the column where the dates are stored.
You don't have to use the for loop so it is faster to run.
I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.
I have a date given that may or may not be a trading day, and I have a pandas dataframe indexed by trading days that has returns of each trading day.
This is my date
dt_query = datetime.datetime(2006, 12, 31, 16)
And I want to do something like this (returns is a pandas dataframe)
returns.ix[pd.Timestamp(dt_query + datetime.timedelta(days = 1))]
However, that may not work as one day ahead may or may not be a trading day. I could make a try block that loops and tries until we find something, but I'm wondering if there's a more elegant way that just uses pandas.
This might not be the most the elegant solution but it works.
Here's the idea: from any date dt_query, within a number of calender days (say 10), there must be trading days, and your next trading day is just the first among them. So you can find all days in returns within dt_query and dt_query + timedelta(days = 10), and then get the first one.
Using your example, it should look like
next_trading_date = returns.index[(returns.index > dt_query) & (returns.index <= dt_query + timedelta(days = 10))][0]
You can check the timedelta of the whole column doing:
delta = returns.column - dt_query
then use np.timedelta64() to define a tolerance used to check which rows you want to select:
tol = np.timedelta64(days=2)
and:
returns[delta < tol]
will return the rows within the desired range...
Thank you! That's been plaguing me for hours.
I altered it a bit:
try:
date_check = dja[start_day]
except KeyError:
print("Start date not a trading day, fetching next trading day...")
test = dja.index.searchsorted(start_day)
next_date = dja.index[(dja.index > start_day)]
start_date = next_date[0]
print("New date:", start_date)