For some reason doing df.resample("M").apply(foo) drops the index name in df. Is this expected behavior?
import pandas as pd
df = pd.DataFrame({"a": np.arange(60)}, index=pd.date_range(start="2018-01-01", periods=60))
df.index.name = "dte"
df.head()
# a
#dte
#2018-01-01 0
#2018-01-02 1
#2018-01-03 2
#2018-01-04 3
#2018-01-05 4
def f(x):
print(x.head())
df.resample("M").apply(f)
#2018-01-01 0
#2018-01-02 1
#2018-01-03 2
#2018-01-04 3
#2018-01-05 4
#Name: a, dtype: int64
update/clarification:
When I said drops the name I meant that series received by the function doesn't have a name component associated with its index
I suggest use alternative of resample - groupby with Grouper:
def f(x):
print(x.head())
df.groupby(pd.Grouper(freq="M")).apply(f)
dte
2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
a
dte
2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
a
dte
2018-02-01 31
2018-02-02 32
2018-02-03 33
2018-02-04 34
2018-02-05 35
Related
I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01
I want to resample this following dataframe from weekly to daily then ffill the missing values.
Note: 2018-01-07 and 2018-01-14 is Sunday.
Date Val
0 2018-01-07 1
1 2018-01-14 2
I tried.
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
offset = pd.offsets.DateOffset(-6)
df.resample('D', loffset=offset).ffill()
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
But I want
Date Val
0 2018-01-01 1
1 2018-01-02 1
2 2018-01-03 1
3 2018-01-04 1
4 2018-01-05 1
5 2018-01-06 1
6 2018-01-07 1
7 2018-01-08 2
8 2018-01-09 2
9 2018-01-10 2
10 2018-01-11 2
11 2018-01-12 2
12 2018-01-13 2
13 2018-01-14 2
What did I do wrong?
You can add new last row manually with subtract offset for datetime:
df.loc[df.index[-1] - offset] = df.iloc[-1]
df = df.resample('D', loffset=offset).ffill()
print (df)
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
2018-01-09 2
2018-01-10 2
2018-01-11 2
2018-01-12 2
2018-01-13 2
2018-01-14 2
I have a dataframe df, which can be created with this:
import pandas as pd
import datetime
#create the dates to make into columns
datestart=datetime.date(2018,1,1)
dateend=datetime.date(2018,1,5)
newcols=pd.date_range(datestart,dateend).date
#create the test data
d={'name':['a','b','c','d'],'earlydate': [datetime.date(2018,1,1),datetime.date(2018,1,3),datetime.date(2018,1,4),datetime.date(2018,1,5)]}
#create initial test dataframe
df=pd.DataFrame(data=d)
#create the new dataframe with empty newcols
df=pd.concat([df,pd.DataFrame(columns=newcols)])
AND Looks like this:
df
Out[17]:
name earlydate 2018-01-01 ... 2018-01-03 2018-01-04 2018-01-05
0 a 2018-01-01 NaN ... NaN NaN NaN
1 b 2018-01-03 NaN ... NaN NaN NaN
2 c 2018-01-04 NaN ... NaN NaN NaN
3 d 2018-01-05 NaN ... NaN NaN NaN
[4 rows x 7 columns]
What I am looking to do is fill all of the empty newcols with the difference in days between the newcol name and the earlydate (newcolname(which is a date)-earlydate(which is a date). I want to do this dataframe 'wise', and not use a function,lambda,apply, or a for loop. I am fairly certain this should be able to be done dataframe wise, not column or row wise.
The result/expected ending df can be created with this:
dresultdata={'name':['a','b','c','d'],
'earlydate': [datetime.date(2018,1,1),datetime.date(2018,1,3),datetime.date(2018,1,4),datetime.date(2018,1,5)],
datetime.date(2018,1,1):[0,-2,-3,-4], #this is the difference in days between the column name and the earlydate
datetime.date(2018,1,2):[-1,1,2,3],
datetime.date(2018,1,3):[-2,0,1,2],
datetime.date(2018,1,4):[-3,-1,0,1]}
dferesult=pd.DataFrame(data=dresultdata)
And looks like this:
dferesult
Out[19]:
name earlydate 2018-01-01 2018-01-02 2018-01-03 2018-01-04
0 a 2018-01-01 0 -1 -2 -3
1 b 2018-01-03 -2 1 0 -1
2 c 2018-01-04 -3 2 1 0
3 d 2018-01-05 -4 3 2 1
I have made this work by looping as follows:
for d in newcols:
df.loc[:,d]=d-df.earlydate
But it takes forever for large frames (1m rows). Ideas welcome!
IIUC:
i = pd.to_datetime(df.earlydate.values).values
j = pd.to_datetime(df.columns[2:]).values
df.iloc[:, 2:] = (j - i[:, None]).astype('timedelta64[D]').astype(int)
df
earlydate name 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05
0 2018-01-01 a 0 1 2 3 4
1 2018-01-03 b -2 -1 0 1 2
2 2018-01-04 c -3 -2 -1 0 1
3 2018-01-05 d -4 -3 -2 -1 0
There are a lot of stations in csv file, I don't know how to use loop to count the number of nan of every station. There is I got so far, count one by one. Can someone help me please, thank you in advance.
station1= train_df[train_df['station'] == 28079004]
station1 = station1[['date', 'O_3']]
count_nan = len(station1) - station1.count()
print(count_nan)
I think need create index by station column with set_index, filter columns for check missing values and last count them by sum:
train_df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'date':pd.date_range('2015-01-01', periods=6),
'O_3':[np.nan,3,np.nan,9,2,np.nan],
'station':[28079004] * 2 + [28079005] * 4})
print (train_df)
B C date O_3 station
0 4 7 2015-01-01 NaN 28079004
1 5 8 2015-01-02 3.0 28079004
2 4 9 2015-01-03 NaN 28079005
3 5 4 2015-01-04 9.0 28079005
4 5 2 2015-01-05 2.0 28079005
5 4 3 2015-01-06 NaN 28079005
df = train_df.set_index('station')[['date', 'O_3']].isnull().sum(level=0).astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Another solution:
df = train_df[['date', 'O_3']].isnull().groupby(train_df['station']).sum().astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Although jez already answered and that answer is probably better here. This is how a groupby would look like:
import pandas as pd
import numpy as np
np.random.seed(444)
n = 10
train_df = pd.DataFrame({
'station': np.random.choice(np.arange(28079004,28079008), size=n),
'date': pd.date_range('2018-01-01', periods=n),
'O_3': np.random.choice([np.nan,1], size=n)
})
print(train_df)
s = train_df.groupby('station')['O_3'].apply(lambda x: x.isna().sum())
print(s)
prints:
station date O_3
0 28079007 2018-01-01 NaN
1 28079004 2018-01-02 1.0
2 28079007 2018-01-03 NaN
3 28079004 2018-01-04 NaN
4 28079007 2018-01-05 NaN
5 28079004 2018-01-06 1.0
6 28079007 2018-01-07 NaN
7 28079004 2018-01-08 NaN
8 28079006 2018-01-09 NaN
9 28079007 2018-01-10 1.0
And the output (s):
station
28079004 2
28079006 1
28079007 4
I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0