Pandas add n number of new date rows to DataFrame - python

I want to add a number of months to the end of my dataframe.
What is the best way to append another six (or 12) months to such a dataframe using dates?
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
4 2013-11-30
Thanks

Edit: I think you might want pd.date_range
df = pd.DataFrame({'date':['2010-01-31', '2010-02-28'], 'x':[1,2]})
df['date'] = pd.to_datetime(df.date)
date x
0 2010-01-31 1
1 2010-02-28 2
Then
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='M', closed='right')}))
date x
0 2010-01-31 1.0
1 2010-02-28 2.0
0 2010-03-31 NaN
1 2010-04-30 NaN
2 2010-05-31 NaN
3 2010-06-30 NaN
4 2010-07-31 NaN

After looking into append and other loop sort of options I created this:
length = df.shape [ 0 ]
add = 12
start = df [ 'month' ].iloc [ 0 ]
count = int ( length + add )
dt = pd.date_range ( start, periods = count, freq = 'M' )
this is the dt I get. It gives the proper ending month days.
DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
now I just have to change from the DatetimeIndex.
I hope this is good code. Cheers.

Related

How to create a dataframe with pandas.date_range for previous years?

I want to create a dataframe with date from previous years. For example something like this -
df = pd.DataFrame({'Years': pd.date_range('2021-09-21', periods=-5, freq='Y')})
but negative period is not supported. How to achieve that?
Use end parameter in date_range aand then add DateOffset:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='Y') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2016-09-21
1 2017-09-21
2 2018-09-21
3 2019-09-21
4 2020-09-21
Or if need also actual year to last value of column use YS for start of year:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='YS') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2017-09-21
1 2018-09-21
2 2019-09-21
3 2020-09-21
4 2021-09-21

Finding the beginning and end dates of when a sequence of values occurs in Pandas

I have a dataframe with an index column and another column that marks whether or not an event occurred on that day with a 1 or 0.
If an event occurred it typically happened continuously for a prolonged period of time. They'll typically mark whether or not a recession occurred, so it'd likely be 60-180 straight days that would be marked with a 1 before going to 0 again.
What I need to do is find the dates that mark the beginning and end of each sequence of 1's.
Here's some quick sample code:
dates = pd.date_range(start='2010-01-01', end='2015-01-01')
nums = np.random.normal(50, 5, 1827)
df = pd.DataFrame(nums, index=dates, columns=['Nums'])
df['Recession'] = np.where((df.index.month == 3) | (df.index.month == 12), 1, 0)
With the example dataframe, the value 1 occurs for the months of March and December, so ideally I'd have a list that reads [2010-03-01, 2010-03-31, 2010-12-01, 2010-12-30, ......, 2015-12-01, 2015-12-30].
I know I could find these values by using a for-loop, but that seems inefficient. I tried using groupby as well, but couldn't find anything that gave the results that I wanted.
Not sure if there's a pandas or numpy method to search an index for the appropriate conditions or not.
Let's try this, using DataFrameGroupBy.idxmin + DataFrameGroupBy.idxmax
# group-by on month, year & aggregate on date
g = (
df.assign(day=df.index.day)
.groupby([df.index.month, df.index.year]).day
)
# create mask of max date & min date for each (month, year) combination
mask = df.index.isin(g.idxmin()) | df.index.isin(g.idxmax())
# apply previous mask with month filter..
df.loc[mask & (df.index.month.isin([3,12])), 'Recession'] = 1
print(df[df['Recession'] == 1])
Nums Recession
2010-03-01 45.698168 1.0
2010-03-31 47.969167 1.0
2010-12-01 49.388595 1.0
2010-12-31 46.689064 1.0
2011-03-01 50.120603 1.0
2011-03-31 58.379980 1.0
2011-12-01 53.745407 1.0
...
...
I would use diff to find the periods, the diff enables to find when it switches from one state to another, so split the indices found in two parts, the starts and ends.
Depending whether the data starts with a recession or not:
locs = (df.Recession.diff().fillna(0)!=0).values.nonzero()[0]
if df.Recession.iloc[0]==0:
start = df.index[locs[::2]]
end = df.index[locs[1::2]-1]
else:
start = df.index[locs[::2]-1]
end = df.index[locs[1::2]]
If the data started with a recession already, up to you if you want to include the first date as a start or not, the above does not include it.
From what I understand you need to find the first value in a sequence? if so we can use groupby and cumsum to sum each consecutive group, and cumcount to count each of the groups.
df["keyGroup"] = (
df.groupby(df["Recession"].ne(df["Recession"].shift()).cumsum()).cumcount() + 1
)
df[df['keyGroup'].eq(1)]
Nums Recession keyGroup
2010-01-01 51.944742 0 1
2010-03-01 54.809271 1 1
2010-04-01 52.632831 0 1
2010-12-01 55.863695 1 1
2011-01-01 52.944778 0 1
2011-03-01 58.164943 1 1
2011-04-01 49.590640 0 1
2011-12-01 47.884919 1 1
2012-01-01 44.128065 0 1
2012-03-01 54.846231 1 1
2012-04-01 51.312064 0 1
2012-12-01 46.091171 1 1
2013-01-01 49.287102 0 1
2013-03-01 54.727874 1 1
2013-04-01 53.163730 0 1
2013-12-01 42.373602 1 1
2014-01-01 43.822791 0 1
2014-03-01 51.203125 1 1
2014-04-01 54.322415 0 1
2014-12-01 44.052536 1 1
2015-01-01 53.438015 0 1
you can call .index to get the values in a list.
df[df['keyGroup'].eq(1)].index
DatetimeIndex(['2010-01-01', '2010-03-01', '2010-04-01', '2010-12-01',
'2011-01-01', '2011-03-01', '2011-04-01', '2011-12-01',
'2012-01-01', '2012-03-01', '2012-04-01', '2012-12-01',
'2013-01-01', '2013-03-01', '2013-04-01', '2013-12-01',
'2014-01-01', '2014-03-01', '2014-04-01', '2014-12-01',
'2015-01-01'],
dtype='datetime64[ns]', name='date', freq=None)

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

How to trim outliers in dates in python?

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?
You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

Resample Dataframe Monthly and store the count for each month in new Dataframe with columns as Date and Count

I want to sample my Dataframe with the date as an index to monthly data and count the instances for each month and then store it in a new dataframe.
Data:
Date Title
2001-05-22 A
2001-05-28 B
2001-06-13 C
2001-06-14 D
2001-06-15 E
2001-07-15 F
2001-07-13 G
2001-07-16 H
2001-07-17 I
. .
. .
. .
2001-12-01 Y
2001-12-31 Z
So I want the ouput should be like:
New Dataframe with columns
Date Count
2001-05-31 2
2001-06-30 3
2001-07-31 4
2001-08-30 1
. .
. .
And after that, plot the data as any graph(bar or which ever looks good for such data) with date as x-axis.
Note: The data is for a long-range period(2001-2017) so x-axis shouldn't get overlap.
You could use pd.Grouper after your you set Date to datetime format:
Starting from your dataframe:
>>> df
Date Title
0 2001-05-22 A
1 2001-05-28 B
2 2001-06-13 C
3 2001-06-14 D
4 2001-06-15 E
5 2001-07-15 F
6 2001-07-13 G
7 2001-07-16 H
8 2001-07-17 I
9 2001-12-01 Y
10 2001-12-31 Z
Set to datetime and groupby month:
df['Date'] = pd.to_datetime(df['Date'])
df.groupby(pd.Grouper(key='Date', freq='m')).count()
Output:
Title
Date
2001-05-31 2
2001-06-30 3
2001-07-31 4
2001-08-31 0
2001-09-30 0
2001-10-31 0
2001-11-30 0
2001-12-31 2
To plot, you can use this as a skeleton (I don't really know what you're looking for in a plot):
df['Date'] = pd.to_datetime(df['Date'])
gb = df.groupby(pd.Grouper(key='Date', freq='m')).count()
import matplotlib.pyplot as plt
plt.bar(gb.index, gb.Title)
plt.ylabel('count')
plt.xticks(rotation=90)
plt.tight_layout()
You said that your DataFrame has date as an index, I would use resample in that case
df.index = pd.to_datetime(df.index)
df.resample('M').count()
Title
Date
2001-05-31 2
2001-06-30 3
2001-07-31 4
2001-08-31 0
2001-09-30 0
2001-10-31 0
2001-11-30 0
2001-12-31 2
To create a plot, use pandas plot
df.resample('M').count().plot()

Categories