Applying Date Operation to Entire Data Frame - python

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.

Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop

One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01

You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)

Related

Get the min value of a week in a pandas dataframe

So lets say I have a pandas dataframe with SOME repeated dates:
import pandas as pd
import random
reportDate = pd.date_range('04-01-2010', '09-03-2021',periods = 5000).date
lowPriceMin = [random.randint(10, 20) for x in range(5000)]
df = pd.DataFrame()
df['reportDate'] = reportDate
df['lowPriceMin'] = lowPriceMin
Now I want to get the min value from every week since the starting date. So I will have around 559 (the number of weeks from '04-01-2010' to '09-03-2021') values with the min value from every week.
Try with resample:
df['reportDate'] = pd.to_datetime(df['reportDate'])
>>> df.set_index("reportDate").resample("W").min()
lowPriceMin
reportDate
2010-01-10 10
2010-01-17 10
2010-01-24 14
2010-01-31 10
2010-02-07 14
...
2021-02-14 11
2021-02-21 11
2021-02-28 10
2021-03-07 10
2021-03-14 17
[584 rows x 1 columns]

add rows to pandas dataframe based on days in week

So I'm fairly new to pandas and I run into this problem that I'm not able to fix.
I have the following dataframe:
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df
I would like to add rows to my dateframe based on the column 'Day', ideally adding all other days of the weeks, but keeping the rest of the columns the same value. The output should look something like this:
Day Product_Finished Product_Tested
0 2018-12-31 1000 50
1 2019-01-01 1000 50
2 2019-01-02 1000 50
3 2019-01-03 1000 50
4 2019-01-04 1000 50
5 2019-01-05 1000 50
6 2019-01-06 1000 50
7 2019-01-07 2000 10
8 2019-01-08 2000 10
9 2019-01-09 2000 10
10 2019-01-10 2000 10
11 2019-01-11 2000 10
12 2019-01-12 2000 10
13 2019-01-13 2000 10
Any tips would be greatly appreciated, thank you in advance!
You can achieve this by first creating a new DataFrame that contains the desired date range using pandas.date_range.
Step 2, use pandas.merge_asof specifying to get the last value.
You can re sample by
import datetime
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df.set_index('Day',inplace=True)
df_Date=pd.date_range(start=df.index.min(), end=(df.index.max()+ datetime.timedelta(days=6)), freq='D')
df=df.reindex(df_Date,method='ffill',fill_value=None)
df.reset_index(inplace=True)

How to trim outliers in dates in python?

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?
You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

pandas time series average monthly volume

I have csv time series data of once per day date, and cumulative sale. Silimar to this
01-01-2010 12:10:10 50.00
01-02-2010 12:10:10 80.00
01-03-2010 12:10:10 110.00
.
. for each dat of 2010
.
01-01-2011 12:10:10 2311.00
01-02-2011 12:10:10 2345.00
01-03-2011 12:10:10 2445.00
.
. for each dat of 2011
.
and so on.
I am looking to get the monthly sale (max - min) for each month in each year. Therefore for past 5 years, I will have 5 Jan values (max - min), 5 Feb values (max - min) ... and so on
once I have those, I next get the (5 years avg) for Jan, 5 years avg for Feb .. and so on.
Right now, I do this by slicing the original df [year/month] and then do the averaging over the specific month of the year.
I am looking to use time series resample() approach, but I am currently stuck at telling PD to sample monthly (max - min) for each month in [past 10 years from today]. and then chain in a .mean()
Any advice on an efficient way to do this with resample() would be appreciated.
It would probably look like something like this (note: no cumulative sale values). The key here is to perform a df.groupby() passing dt.year and dt.month.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range(start='2016-01-01',end='2017-12-31'),
'sale': np.random.randint(100,200, size = 365*2+1)
})
# Get month max, min and size (and as they are sorted - last and first)
dfg = df.groupby([df.date.dt.year,df.date.dt.month])['sale'].agg(['last','first','size'])
# Assign new cols (diff and avg) and drop max min size
dfg = dfg.assign(diff = dfg['last'] - dfg['first'])
dfg = dfg.assign(avg = dfg['diff'] / dfg['size']).drop(['last','first','size'], axis=1)
# Rename index cols
dfg.index = dfg.index.rename(['Year','Month'])
print(dfg.head(6))
Returns:
diff avg
Year Month
2016 1 -56 -1.806452
2 -17 -0.586207
3 30 0.967742
4 34 1.133333
5 46 1.483871
6 2 0.066667
You can do it with a resample*2:
First resample to a month (M) and get the diff (max()-min())
Then resample to 5 years (5AS), and groupby month and take the mean()
E.g.:
In []:
date_range = pd.date_range(start='2008-01-01',end='2017-12-31')
df = pd.DataFrame({'sale': np.random.randint(100, 200, size=date_range.size)},
index=date_range)
In []:
df1 = df.resample('M').apply(lambda g: g.max()-g.min())
df1.resample('5AS').apply(lambda g: g.groupby(g.index.month).mean()).unstack()
Out[]:
sale
1 2 3 4 5 6 7 8 9 10 11 12
2008-01-01 95.4 90.2 95.2 95.4 93.2 93.8 91.8 95.6 93.4 93.4 94.2 93.8
2013-01-01 93.2 96.4 92.8 96.4 92.6 93.0 93.2 92.6 91.2 93.2 91.8 92.2

Start increment column at begining of month

thank you in advance for your assistance.
Want set 'Counter' to 1 whenever there is change in month, and increment by 1 until month changes again, and repeat. Like so:
A Month Counter
2015-10-30 -1.478066 10 21
2015-10-31 -1.562437 10 22
2015-11-01 -0.292285 11 1
2015-11-02 -1.581140 11 2
2015-11-03 0.603113 11 3
2015-11-04 -0.543563 11 4
In [1]: import pandas as pd
import numpy as np
In [2]: dates = pd.date_range('20151030',periods=6)
In [3]: df =pd.DataFrame(np.random.randn(6,1),index=dates,columns=list('A'))
In [4]: df
Out[4]: A
2015-10-30 -1.478066
2015-10-31 -1.562437
2015-11-01 -0.292285
2015-11-02 -1.581140
2015-11-03 0.603113
2015-11-04 -0.543563
Tried this, adds 1 to actual month integer:
In [5]: df['Month'] = df.index.month
In [6]: df['Counter'] df['Counter']=np.where(df['Month'] <> df['Month'], (1), (df['Month'].shift()+1))
In [7]: df
Out[7]: A Month Counter
2015-10-30 -1.478066 10 NaN
2015-10-31 -1.562437 10 11
2015-11-01 -0.292285 11 11
2015-11-02 -1.581140 11 12
2015-11-03 0.603113 11 12
2015-11-04 -0.543563 11 12
Tried datetime, getting closer:
In[8]: from datetime import timedelta
In[9]: df['Counter'] = df.index + timedelta(days=1)
Out[9]: A Month Counter
2015-10-30 -0.478066 11 2015-10-31
2015-10-31 -1.562437 10 2015-11-01
2015-11-01 -0.292285 11 2015-11-02
2015-11-02 -1.581140 11 2015-11-03
2015-11-03 0.603113 11 2015-11-04
2015-11-04 -0.543563 11 2015-11-05
Latter give me the date, but not my counter. New to python, so any help is appreciated. Thank you!
Edit, extending df to periods=300 to include over 12 months of data:
In[10]: dates = pd.date_range('19971002',periods=300)
In[11]: df=pd.DataFrame(np.random.randn(300,1),index=dates,columns=list('A'))
In[12]: df['Counter'] = df.groupby(df.index.month).cumcount()+1
In[13]: df.head()
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
In[14]: df[250:]
Out[14] A Counter
1998-09-29 -0.875468 20
1998-09-30 1.498145 21
1998-10-01 0.141262 24
1998-10-02 0.581974 25
Desired results:
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
Code works fine (Out[13] above), seems to be once data goes beyond 12 months counter keeps on incrementing +1 instead of setting back to 1([Out 14] above. Also, getting tricky here, random date generator includes weekend, my data only has weekday data. Hope that helps me help you to help me better. Thank you!
You could use groupby/cumcount to assign a cumulative count to each group:
import pandas as pd
import numpy as np
N = 300
dates = pd.date_range('19971002', periods=N, freq='B')
df = pd.DataFrame(np.random.randn(N, 1),index=dates,columns=list('A'))
df['Counter'] = df.groupby([df.index.year, df.index.month]).cumcount()+1
print(df.loc['1998-09-25':'1998-10-05'])
yields
A Counter
1998-09-25 -0.511721 19
1998-09-28 1.912757 20
1998-09-29 -0.988309 21
1998-09-30 1.277888 22
1998-10-01 -0.579450 1
1998-10-02 -2.486014 2
1998-10-05 0.728789 3

Categories