How to delete a date from pandas date_range - python

So I have a pandas date_range like so
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
I want to remove all the extra days resulting from leap years.
I do a for loop
for each in index:
if each.month==2 and each.day==29:
print(each) # I actually want to delete this item from dates
But my problem is that I don't know how to delete the item. The regular python list methods and functions doesn't work.
I've looked everywhere on SO. I've looked at the documentation for pandas.date_range but found nothing
Any help will be appreciated.

You probably want to use drop to remove the rows.
import pandas as pd
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
leap = []
for each in dates:
if each.month==2 and each.day ==29:
leap.append(each)
dates = dates.drop(leap)

You could try creating two Series objects to store the months and days separately and use them as masks.
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D') #All dates between range
days = dates.day #Store all the days
months = dates.month #Store all the months
dates = dates[(days != 29) & (months != 2)] #Filter dates using a mask
Just to check if the approach works, If you change the != condition to ==, we can see the dates you wish to eliminate.
UnwantedDates = dates[(days == 29) & (months == 2)]
Output:
DatetimeIndex(['2008-02-29', '2012-02-29'], dtype='datetime64[ns]', freq=None)

You can try:
dates = dates[~dates['Date'].str.contains('02-29')]
In place of Date you will have to put the name of the column where the dates are stored.
You don't have to use the for loop so it is faster to run.

Related

How to slice a pandas DataFrame between two dates (day/month) ignoring the year?

I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1

How to sort dates imported from a CSV file?

I'm trying to write a program that can print a list of sorted dates but it keeps sorting by the 'day' instead of the full date, day,month,year
Im very new to python so theres probably a lot i'm doing wrong but any help would be greatly appreciated.
So I have it so that you can view the list over two pages.
the dates will sort
12/03/2004
13/08/2001
15/10/2014
but I need the full date sorted
df = pd.read_csv('Employee.csv')
df = df.sort_values('Date of Employment.')
List1 = df.iloc[:50, 1:]
List2 = df.iloc[50:99, 1:]
The datetime data type has to be used for the dates to be sorted correctly
You need to use either one of these approaches to convert the dates to datetime objects:
Approach 1
pd.to_datetime + DataFrame.sort_values:
df['Date of Employment.'] = pd.to_datetime(df['Date of Employment.']')
Approach 2
You can parse the dates at the same time that the Pandas DataFrame is being loaded:
df = pd.read_csv('Employee.csv', parse_dates=['Date of Employement.'])
This is equivalent to the first approach with the exception that everything is done in one step.
Next you need to sort the datetime values in either ascending or descending order.
Ascending:
`df.sort_values('Date of Employment.')`
Descending
`df.sort_values('Date of Employment.',ascending=False)`
You need to convert Date of Employment. to a Date before sorting
df['Date of Employment.'] = pd.to_datetime(df['Date of Employment.'],format= '%d/%m/%Y')
Otherwise it's just strings for Python

Python - Group Dates by Month

Here's a quick problem that I, at first, dismissed as easy. An hour in, and I'm not so sure!
So, I have a list of Python datetime objects, and I want to graph them. The x-values are the year and month, and the y-values would be the amount of date objects in this list that happened in this month.
Perhaps an example will demonstrate this better (dd/mm/yyyy):
[28/02/2018, 01/03/2018, 16/03/2018, 17/05/2018]
-> ([02/2018, 03/2018, 04/2018, 05/2018], [1, 2, 0, 1])
My first attempt tried to simply group by date and year, along the lines of:
import itertools
group = itertools.groupby(dates, lambda date: date.strftime("%b/%Y"))
graph = zip(*[(k, len(list(v)) for k, v in group]) # format the data for graphing
As you've probably noticed though, this will group only by dates that are already present in the list. In my example above, the fact that none of the dates occurred in April would have been overlooked.
Next, I tried finding the starting and ending dates, and looping over the months between them:
import datetime
data = [[], [],]
for year in range(min_date.year, max_date.year):
for month in range(min_date.month, max_date.month):
k = datetime.datetime(year=year, month=month, day=1).strftime("%b/%Y")
v = sum([1 for date in dates if date.strftime("%b/%Y") == k])
data[0].append(k)
data[1].append(v)
Of course, this only works if min_date.month is smaller than max_date.month which is not necessarily the case if they span multiple years. Also, its pretty ugly.
Is there an elegant way of doing this?
Thanks in advance
EDIT: To be clear, the dates are datetime objects, not strings. They look like strings here for the sake of being readable.
I suggest use pandas:
import pandas as pd
dates = ['28/02/2018', '01/03/2018', '16/03/2018', '17/05/2018']
s = pd.to_datetime(pd.Series(dates), format='%d/%m/%Y')
s.index = s.dt.to_period('m')
s = s.groupby(level=0).size()
s = s.reindex(pd.period_range(s.index.min(), s.index.max(), freq='m'), fill_value=0)
print (s)
2018-02 1
2018-03 2
2018-04 0
2018-05 1
Freq: M, dtype: int64
s.plot.bar()
Explanation:
First create Series from list of dates and convert to_datetimes.
Create PeriodIndex by Series.dt.to_period
groupby by index (level=0) and get counts by GroupBy.size
Add missing periods by Series.reindex by PeriodIndex created by max and min values of index
Last plot, e.g. for bars - Series.plot.bar
using Counter
dates = list()
import random
import collections
for y in range(2015,2019):
for m in range(1,13):
for i in range(random.randint(1,4)):
dates.append("{}/{}".format(m,y))
print(dates)
counter = collections.Counter(dates)
print(counter)
for your problem with dates with no occurrences you can use the subtract method of Counter
generate a list with all range of dates, each date will appear on the list only once, and then you can use subtract
like so
tmp_date_list = ["{}/{}".format(m,y) for y in range(2015,2019) for m in range(1,13)]
counter.subtract(tmp_date_list)

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Python Pandas: Resample date range

I create a list of pandas datetimes with the following line:
range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
Is there a quick way to resample that list so that it spans the same range but with a frequency of 30 min? (so far I can only see applications of that logic to Series or DataFrames index, but no daterange)
One way is:
date_range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
new_date_range = date_range.to_series().asfreq('30 min').index
Also, range is a builtin function, so I would not call the variable "range".
I hope this helps.
You could also do
date_range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
new_date_range = pd.date_range(date_range.min(), date_range.max(), freq='30 min')

Categories