How to trim outliers in dates in python? - python

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?

You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03

Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

How do I specify certain date as the first week and calculate the week number in pandas?

how to convert time to week number
year_start = '2019-05-21'
year_end = '2020-02-22'
How do I get the week number based on the date that I set as first week?
For example 2019-05-21 should be Week 1 instead of 2019-01-01
If you do not have dates outside of year_start/year_end, use isocalendar().week and perform a simple subtraction with modulo:
year_start = pd.to_datetime('2019-05-21')
#year_end = pd.to_datetime('2020-02-22')
df = pd.DataFrame({'date': pd.date_range('2019-05-21', '2020-02-22', freq='30D')})
df['week'] = (df['date'].dt.isocalendar().week.astype(int)-year_start.isocalendar()[1])%52+1
Output:
date week
0 2019-05-21 1
1 2019-06-20 5
2 2019-07-20 9
3 2019-08-19 14
4 2019-09-18 18
5 2019-10-18 22
6 2019-11-17 26
7 2019-12-17 31
8 2020-01-16 35
9 2020-02-15 39
Try the following code.
import numpy as np
import pandas as pd
year_start = '2019-05-21'
year_end = '2020-02-22'
# Create a sample dataframe
df = pd.DataFrame(pd.date_range(year_start, year_end, freq='D'), columns=['date'])
# Add the week number
df['week_number'] = (((df.date.view(np.int64) - pd.to_datetime([year_start]).view(np.int64)) / (1e9 * 60 * 60 * 24) - df.date.dt.day_of_week + 7) // 7 + 1).astype(np.int64)
date
week_number
2019-05-21
1
2019-05-22
1
2019-05-23
1
2019-05-24
1
2019-05-25
1
2019-05-26
1
2019-05-27
2
2019-05-28
2
2020-02-18
40
2020-02-19
40
2020-02-20
40
2020-02-21
40
2020-02-22
40
If you just need a function to calculate week no, based on given start and end date:
import pandas as pd
import numpy as np
start_date = "2019-05-21"
end_date = "2020-02-22"
start_datetime = pd.to_datetime(start_date)
end_datetime = pd.to_datetime(end_date)
def get_week_no(date):
given_datetime = pd.to_datetime(date)
# if date in range
if start_datetime <= given_datetime <= end_datetime:
x = given_datetime - start_datetime
# adding 1 as it will return 0 for 1st week
return int(x / np.timedelta64(1, 'W')) + 1
raise ValueError(f"Date is not in range {start_date} - {end_date}")
print(get_week_no("2019-05-21"))
In the function, we are calculating week no by finding difference between given date and start date in weeks.

Pandas add n number of new date rows to DataFrame

I want to add a number of months to the end of my dataframe.
What is the best way to append another six (or 12) months to such a dataframe using dates?
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
4 2013-11-30
Thanks
Edit: I think you might want pd.date_range
df = pd.DataFrame({'date':['2010-01-31', '2010-02-28'], 'x':[1,2]})
df['date'] = pd.to_datetime(df.date)
date x
0 2010-01-31 1
1 2010-02-28 2
Then
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='M', closed='right')}))
date x
0 2010-01-31 1.0
1 2010-02-28 2.0
0 2010-03-31 NaN
1 2010-04-30 NaN
2 2010-05-31 NaN
3 2010-06-30 NaN
4 2010-07-31 NaN
After looking into append and other loop sort of options I created this:
length = df.shape [ 0 ]
add = 12
start = df [ 'month' ].iloc [ 0 ]
count = int ( length + add )
dt = pd.date_range ( start, periods = count, freq = 'M' )
this is the dt I get. It gives the proper ending month days.
DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
now I just have to change from the DatetimeIndex.
I hope this is good code. Cheers.

Handle multiple date formats in pandas dataframe

I have a dataframe (imported from Excel) which looks like this:
Date Period
0 2017-03-02 2017-03-01 00:00:00
1 2017-03-02 2017-04-01 00:00:00
2 2017-03-02 2017-05-01 00:00:00
3 2017-03-02 2017-06-01 00:00:00
4 2017-03-02 2017-07-01 00:00:00
5 2017-03-02 2017-08-01 00:00:00
6 2017-03-02 2017-09-01 00:00:00
7 2017-03-02 2017-10-01 00:00:00
8 2017-03-02 2017-11-01 00:00:00
9 2017-03-02 2017-12-01 00:00:00
10 2017-03-02 Q217
11 2017-03-02 Q317
12 2017-03-02 Q417
13 2017-03-02 Q118
14 2017-03-02 Q218
15 2017-03-02 Q318
16 2017-03-02 Q418
17 2017-03-02 2018
I am trying to convert all the 'Period' column into a consistent format. Some elements look already in the datetime format, others are converted to string (ex. Q217), others to int (ex 2018). Which is the fastest way to convert everything in a datetime? I was trying with some masking, like this:
mask = df['Period'].str.startswith('Q', na = False)
list_quarter = df_final[mask]['Period'].tolist()
quarter_convert = {'1':'31/03', '2':'30/06', '3':'31/08', '4':'30/12'}
counter = 0
for element in list_quarter:
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = daymonth+'/'+year
list_quarter[counter] = final
counter+=1
However it fails when I try to substitute the modified elements in the original column:
df_nwe_final['Period'] = np.where(mask, pd.Series(list_quarter), df_nwe_final['Period'])
Of course I would need to do more or less the same with the 2018 type formats. However, I am sure I am missing something here, and there should be a much faster solution. Some fresh ideas from you would help! Thank you.
Reusing the code you show, let's first write a function that converts the Q-string to a datetime format (I adjusted to final format a little bit):
def convert_q_string(element):
quarter_convert = {'1':'03-31', '2':'06-30', '3':'08-31', '4':'12-30'}
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = '20' + year + '-' + daymonth
return final
We can now use this to first convert all 'Q'-strings, and then pd.to_datetime to convert all elements to proper datetime values:
In [2]: s = pd.Series(['2017-03-01 00:00:00', 'Q217', '2018'])
In [3]: mask = s.str.startswith('Q')
In [4]: s[mask] = s[mask].map(convert_q_string)
In [5]: s
Out[5]:
0 2017-03-01 00:00:00
1 2017-06-30
2 2018
dtype: object
In [6]: pd.to_datetime(s)
Out[6]:
0 2017-03-01
1 2017-06-30
2 2018-01-01
dtype: datetime64[ns]

Start increment column at begining of month

thank you in advance for your assistance.
Want set 'Counter' to 1 whenever there is change in month, and increment by 1 until month changes again, and repeat. Like so:
A Month Counter
2015-10-30 -1.478066 10 21
2015-10-31 -1.562437 10 22
2015-11-01 -0.292285 11 1
2015-11-02 -1.581140 11 2
2015-11-03 0.603113 11 3
2015-11-04 -0.543563 11 4
In [1]: import pandas as pd
import numpy as np
In [2]: dates = pd.date_range('20151030',periods=6)
In [3]: df =pd.DataFrame(np.random.randn(6,1),index=dates,columns=list('A'))
In [4]: df
Out[4]: A
2015-10-30 -1.478066
2015-10-31 -1.562437
2015-11-01 -0.292285
2015-11-02 -1.581140
2015-11-03 0.603113
2015-11-04 -0.543563
Tried this, adds 1 to actual month integer:
In [5]: df['Month'] = df.index.month
In [6]: df['Counter'] df['Counter']=np.where(df['Month'] <> df['Month'], (1), (df['Month'].shift()+1))
In [7]: df
Out[7]: A Month Counter
2015-10-30 -1.478066 10 NaN
2015-10-31 -1.562437 10 11
2015-11-01 -0.292285 11 11
2015-11-02 -1.581140 11 12
2015-11-03 0.603113 11 12
2015-11-04 -0.543563 11 12
Tried datetime, getting closer:
In[8]: from datetime import timedelta
In[9]: df['Counter'] = df.index + timedelta(days=1)
Out[9]: A Month Counter
2015-10-30 -0.478066 11 2015-10-31
2015-10-31 -1.562437 10 2015-11-01
2015-11-01 -0.292285 11 2015-11-02
2015-11-02 -1.581140 11 2015-11-03
2015-11-03 0.603113 11 2015-11-04
2015-11-04 -0.543563 11 2015-11-05
Latter give me the date, but not my counter. New to python, so any help is appreciated. Thank you!
Edit, extending df to periods=300 to include over 12 months of data:
In[10]: dates = pd.date_range('19971002',periods=300)
In[11]: df=pd.DataFrame(np.random.randn(300,1),index=dates,columns=list('A'))
In[12]: df['Counter'] = df.groupby(df.index.month).cumcount()+1
In[13]: df.head()
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
In[14]: df[250:]
Out[14] A Counter
1998-09-29 -0.875468 20
1998-09-30 1.498145 21
1998-10-01 0.141262 24
1998-10-02 0.581974 25
Desired results:
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
Code works fine (Out[13] above), seems to be once data goes beyond 12 months counter keeps on incrementing +1 instead of setting back to 1([Out 14] above. Also, getting tricky here, random date generator includes weekend, my data only has weekday data. Hope that helps me help you to help me better. Thank you!
You could use groupby/cumcount to assign a cumulative count to each group:
import pandas as pd
import numpy as np
N = 300
dates = pd.date_range('19971002', periods=N, freq='B')
df = pd.DataFrame(np.random.randn(N, 1),index=dates,columns=list('A'))
df['Counter'] = df.groupby([df.index.year, df.index.month]).cumcount()+1
print(df.loc['1998-09-25':'1998-10-05'])
yields
A Counter
1998-09-25 -0.511721 19
1998-09-28 1.912757 20
1998-09-29 -0.988309 21
1998-09-30 1.277888 22
1998-10-01 -0.579450 1
1998-10-02 -2.486014 2
1998-10-05 0.728789 3

Categories