Start increment column at begining of month

Start increment column at begining of month - python

thank you in advance for your assistance.
Want set 'Counter' to 1 whenever there is change in month, and increment by 1 until month changes again, and repeat. Like so:
A Month Counter
2015-10-30 -1.478066 10 21
2015-10-31 -1.562437 10 22
2015-11-01 -0.292285 11 1
2015-11-02 -1.581140 11 2
2015-11-03 0.603113 11 3
2015-11-04 -0.543563 11 4
In [1]: import pandas as pd
import numpy as np
In [2]: dates = pd.date_range('20151030',periods=6)
In [3]: df =pd.DataFrame(np.random.randn(6,1),index=dates,columns=list('A'))
In [4]: df
Out[4]: A
2015-10-30 -1.478066
2015-10-31 -1.562437
2015-11-01 -0.292285
2015-11-02 -1.581140
2015-11-03 0.603113
2015-11-04 -0.543563
Tried this, adds 1 to actual month integer:
In [5]: df['Month'] = df.index.month
In [6]: df['Counter'] df['Counter']=np.where(df['Month'] <> df['Month'], (1), (df['Month'].shift()+1))
In [7]: df
Out[7]: A Month Counter
2015-10-30 -1.478066 10 NaN
2015-10-31 -1.562437 10 11
2015-11-01 -0.292285 11 11
2015-11-02 -1.581140 11 12
2015-11-03 0.603113 11 12
2015-11-04 -0.543563 11 12
Tried datetime, getting closer:
In[8]: from datetime import timedelta
In[9]: df['Counter'] = df.index + timedelta(days=1)
Out[9]: A Month Counter
2015-10-30 -0.478066 11 2015-10-31
2015-10-31 -1.562437 10 2015-11-01
2015-11-01 -0.292285 11 2015-11-02
2015-11-02 -1.581140 11 2015-11-03
2015-11-03 0.603113 11 2015-11-04
2015-11-04 -0.543563 11 2015-11-05
Latter give me the date, but not my counter. New to python, so any help is appreciated. Thank you!
Edit, extending df to periods=300 to include over 12 months of data:
In[10]: dates = pd.date_range('19971002',periods=300)
In[11]: df=pd.DataFrame(np.random.randn(300,1),index=dates,columns=list('A'))
In[12]: df['Counter'] = df.groupby(df.index.month).cumcount()+1
In[13]: df.head()
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
In[14]: df[250:]
Out[14] A Counter
1998-09-29 -0.875468 20
1998-09-30 1.498145 21
1998-10-01 0.141262 24
1998-10-02 0.581974 25
Desired results:
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
Code works fine (Out[13] above), seems to be once data goes beyond 12 months counter keeps on incrementing +1 instead of setting back to 1([Out 14] above. Also, getting tricky here, random date generator includes weekend, my data only has weekday data. Hope that helps me help you to help me better. Thank you!

You could use groupby/cumcount to assign a cumulative count to each group:
import pandas as pd
import numpy as np
N = 300
dates = pd.date_range('19971002', periods=N, freq='B')
df = pd.DataFrame(np.random.randn(N, 1),index=dates,columns=list('A'))
df['Counter'] = df.groupby([df.index.year, df.index.month]).cumcount()+1
print(df.loc['1998-09-25':'1998-10-05'])
yields
A Counter
1998-09-25 -0.511721 19
1998-09-28 1.912757 20
1998-09-29 -0.988309 21
1998-09-30 1.277888 22
1998-10-01 -0.579450 1
1998-10-02 -2.486014 2
1998-10-05 0.728789 3

Related

Get the min value of a week in a pandas dataframe

So lets say I have a pandas dataframe with SOME repeated dates:
import pandas as pd
import random
reportDate = pd.date_range('04-01-2010', '09-03-2021',periods = 5000).date
lowPriceMin = [random.randint(10, 20) for x in range(5000)]
df = pd.DataFrame()
df['reportDate'] = reportDate
df['lowPriceMin'] = lowPriceMin
Now I want to get the min value from every week since the starting date. So I will have around 559 (the number of weeks from '04-01-2010' to '09-03-2021') values with the min value from every week.

Try with resample:
df['reportDate'] = pd.to_datetime(df['reportDate'])
>>> df.set_index("reportDate").resample("W").min()
lowPriceMin
reportDate
2010-01-10 10
2010-01-17 10
2010-01-24 14
2010-01-31 10
2010-02-07 14
...
2021-02-14 11
2021-02-21 11
2021-02-28 10
2021-03-07 10
2021-03-14 17
[584 rows x 1 columns]

How to trim outliers in dates in python?

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?

You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03

Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

Pandas elements per week between start date and end date

I'm starting from a dataframe that has a start date and an end date, for instance:
ID START END A
0 2014-04-09 2014-04-15 5
1 2018-06-05 2018-07-01 8
2 2018-06-05 2018-07-01 7
And I'm trying to find, for each week, how many elements were started but not ended at that point.
For instance, in the DF above:
Week-Monday N
2014-04-07 1
2014-04-14 1
2014-04-21 0
...
2018-06-04 2
...
Something like the below doesn't quite work, since it only resamples on end date:
df = df.resample("W-Mon", on="END").sum()
I don't know how to integrate both conditions: that the occurrences be after the start date, yet before the end date.

You can start from here:
import pandas as pd
df = pd.DataFrame({'ID':[0,1,2],
'START':['2014-04-09', '2018-06-05', '2018-06-05'],
'END':['2014-04-15', '2018-07-01', '2018-07-01'],
'A':[5,8,7]})
1- Find week number for each SRART and each END, and find Week-Monday.
import datetime, time
from datetime import timedelta
df.loc[:,'startWeek'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:,'endWeek'] = df.END.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:, 'Week-Monday'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')- timedelta(days=datetime.datetime.strptime(x,'%Y-%m-%d').weekday()))
2- Check if they are the same, if yes, then ended during the same week.
def endedNotSameWeek(row):
if row['startWeek']!=row['endWeek']:
return 1
return 0
df.loc[:,'NotSameWeek'] = df.apply(endedNotSameWeek, axis=1)
print(df)
Output:
ID START END A startWeek endWeek Week-Monday NotSameWeek
0 0 2014-04-09 2014-04-15 5 15 16 2014-04-07 1
1 1 2018-06-05 2018-07-01 8 23 26 2018-06-04 1
2 2 2018-06-05 2018-07-01 7 23 26 2018-06-04 1
3- Groupby each Week-Monday to get the number of cases did not end during the same week.
df.groupby('Week-Monday')['NotSameWeek'].agg({'N':'sum'}).reset_index()
Week-Monday N
0 2014-04-07 1
1 2018-06-04 2

filtering date column in python

I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05

Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37

Try
today = datetime.datetime.now()

You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?

Applying Date Operation to Entire Data Frame

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.

Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop

One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01

You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Start increment column at begining of month - python

Related

Get the min value of a week in a pandas dataframe

How to trim outliers in dates in python?

Pandas elements per week between start date and end date

filtering date column in python

Applying Date Operation to Entire Data Frame

Categories

Resources