Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values
I have 4 portfolios a,b,c,d which can take on values either "no" or "own" over a period of time. (code included below to facilitate replication)
ano=('a','no',datetime(2018,1,1), datetime(2018,1,2))
aown=('a','own',datetime(2018,1,3), datetime(2018,1,4))
bno=('b','no',datetime(2018,1,1), datetime(2018,1,5))
bown=('b','own',datetime(2018,1,6), datetime(2018,1,7))
cown=('c','own',datetime(2018,1,9), datetime(2018,1,10))
down=('d','own',datetime(2018,1,9), datetime(2018,1,9))
sch=pd.DataFrame([ano,aown,bno,bown,cown,down],columns=['portf','base','st','end'])
Summary of schedule:
portf base st end
0 a no 2018-01-01 2018-01-02
1 a own 2018-01-03 2018-01-04
2 b no 2018-01-01 2018-01-05
3 b own 2018-01-06 2018-01-07
4 c own 2018-01-09 2018-01-10
5 d own 2018-01-09 2018-01-09
What I have tried: create a holding dataframe and filling in values based on the schedule. Unfortunately the first portfolio 'a' gets overridden
df=pd.DataFrame(index=pd.date_range(min(sch.st),max(sch.end)),columns=['portf','base'])
for row in range(len(sch)):
df.loc[sch['st'][row]:sch['end'][row],['portf','base']]= sch.loc[row,['portf','base']].values
portf base
2018-01-01 b no
2018-01-02 b no
2018-01-03 b no
2018-01-04 b no
2018-01-05 b no
2018-01-06 b own
2018-01-07 b own
2018-01-08 NaN NaN
2018-01-09 d own
2018-01-10 c own
desired output:
2018-01-01 (('a','no'), ('b','no'))
2018-01-02 (('a','no'), ('b','no'))
2018-01-03 (('a','own'), ('b','no'))
2018-01-04 (('a','own'), ('b','no'))
2018-01-05 ('b','no')
...
I am sure there's an easier way of achieving this but probably this is an example I haven't encountered before. Many thanks in advance!
I would organize the data differently, index is date, columns for portf and the values are base.
First we need to reshape the data and resample to daily fields. Then it's a simple pivot.
cols = ['portf', 'base']
s = (df.reset_index()
.melt(cols+['index'], value_name='date')
.set_index('date')
.groupby(cols+['index'], group_keys=False)
.resample('D').ffill()
.drop(columns=['variable', 'index'])
.reset_index())
res = s.pivot(index='date', columns='portf')
res = res.resample('D').first() # Recover missing dates between
Output res
base
portf a b c d
2018-01-01 no no NaN NaN
2018-01-02 no no NaN NaN
2018-01-03 own no NaN NaN
2018-01-04 own no NaN NaN
2018-01-05 NaN no NaN NaN
2018-01-06 NaN own NaN NaN
2018-01-07 NaN own NaN NaN
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN own own
2018-01-10 NaN NaN own NaN
If you need your other output, we can get there with some less than ideal Series.apply calls. This will be very bad for a large DataFrame; I would seriously consider keeping the above.
s.set_index('date').apply(tuple, axis=1).groupby('date').apply(tuple)
date
2018-01-01 ((a, no), (b, no))
2018-01-02 ((a, no), (b, no))
2018-01-03 ((a, own), (b, no))
2018-01-04 ((a, own), (b, no))
2018-01-05 ((b, no),)
2018-01-06 ((b, own),)
2018-01-07 ((b, own),)
2018-01-09 ((c, own), (d, own))
2018-01-10 ((c, own),)
dtype: object
I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.
Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0
Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']
I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0
How to find and remove rows from DataFrame with values in a specific range, for example dates greater than '2017-03-02' and smaller than '2017-03-05'
import pandas as pd
d_index = pd.date_range('2018-01-01', '2018-01-06')
d_values = pd.date_range('2017-03-01', '2017-03-06')
s = pd.Series(d_values)
s = s.rename('values')
df = pd.DataFrame(s)
df = df.set_index(d_index)
# remove rows with specific values in 'value' column
In example above I have d_values ordered from earliest to the latest date so in this case slicing dataframe by index could do the work. But I am looking for solution that would work also when d_values contain not ordered random date values. Is there any way to do it in pandas?
Option 1
pd.Series.between seems suited for this task.
df[~df['values'].between('2017-03-02', '2017-03-05', inclusive=False)]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Details
between identifies all items within the range -
m = df['values'].between('2017-03-02', '2017-03-05', inclusive=False)
m
2018-01-01 False
2018-01-02 False
2018-01-03 True
2018-01-04 True
2018-01-05 False
2018-01-06 False
Freq: D, Name: values, dtype: bool
Use the mask to filter on df -
df = df[~m]
Option 2
Alternatively, with the good ol' old logical OR -
df[~(df['values'].gt('2017-03-02') & df['values'].lt('2017-03-05'))]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Note that both options work with datetime objects as well as string date columns (in which case, the comparison is lexicographic).
first let's shuffle your DF:
In [65]: df = df.sample(frac=1)
In [66]: df
Out[66]:
values
2018-01-03 2017-03-03
2018-01-04 2017-03-04
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
you can use DataFrame.eval method (thanks # cᴏʟᴅsᴘᴇᴇᴅ for the correction!):
In [70]: df[~df.eval("'2017-03-02' < values < '2017-03-05'")]
Out[70]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
or DataFrame.query():
In [300]: df.query("not ('2017-03-02' < values < '2017-03-05')")
Out[300]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02