Select rows based on column condition and date time

Select rows based on column condition and date time - python

From below picture, we see that serial C was failed on 3rd January, and A failed on 5th January within 6 days period. I am interested to take samples for 3 days before the failure of each serial number.
My codes:
from pickle import TRUE
import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='\t')
df['date'] = pd.to_datetime(df['date'])
#df.drop(columns=df.columns[0], axis=1, inplace=True)
df = df.sort_values(by="date")
d = datetime.timedelta(days = 3)
df_fail_date = df[df['failure']==1].groupby(['serial_number'])['date'].min()
df_fail_date = df_fail_date - d
df_fail_date
I was not not able to move further to sample my data. I am interested to get the following data, that is 3 days before the failure. Serial C had only 1 day available before failure so I wanna keep that one as well. It would be nice to add duration column to count the days before failure occurred. I appreciate your suggestions. Thanks!
Expected output dataframe:

You can use a groupby.rolling to get the dates/serials to keep, then merge to select:
df['date'] = pd.to_datetime(df['date'])
N = 3
m = (df.sort_values(by='date')
.loc[::-1]
.groupby('serial_number', group_keys=False)
.rolling(f'{N+1}d', on='date')
['failure'].max().eq(1)
.iloc[::-1]
)
out = df.merge(m[m], left_on=['serial_number', 'date'],
right_index=True, how='right')
Output:
date serial_number failure_x smart_5_raw smart_187_raw failure_y
2 2014-01-01 C 0 0 80 True
8 2014-01-02 C 0 0 200 True
4 2014-01-03 C 1 0 120 True
7 2014-01-02 A 0 0 180 True
5 2014-01-03 A 0 0 140 True
9 2014-01-04 A 0 0 280 True
14 2014-01-05 A 1 0 400 True

Another possible solution:
N = 4
df['date'] = pd.to_datetime(df['date'])
(df[df.groupby('serial_number')['failure'].transform(sum) == 1]
.sort_values(by=['serial_number', 'date'])
.groupby('serial_number')
.apply(lambda g:
g.assign(duration=1+np.arange(min(0, min(N, len(g))-len(g)), min(N, len(g)))))
.loc[lambda x: x['duration'] > 0]
.reset_index(drop=True))
Output:
date serial_number failure smart_5_raw smart_187_raw duration
0 2014-01-02 A 0 0 180 1
1 2014-01-03 A 0 0 140 2
2 2014-01-04 A 0 0 280 3
3 2014-01-05 A 1 0 400 4
4 2014-01-01 C 0 0 80 1
5 2014-01-02 C 0 0 200 2
6 2014-01-03 C 1 0 120 3

Related

How to extract features using date range within a month?

I would like to extract features from a datetime column for a day/date for example between day 1 to 10, the output is stored under a column called
early_month
as 1 or 0 otherwise.
The following question I posted earlier gave me a solution using indexer_between_time in order to use time ranges.
How to extract features using time range?
I am using the following code to extract days of the month from date.
df["date_of_month"] = df["purchase_date"].dt.day
Thank you.

It's not clear from your question, but if you are trying to create a column that contains a 1 if the day is between 1 and 10, or 0 otherwise, it's very simple:
df['early_month'] = df['date_of_month'].apply(lambda x: 1 if x <= 10 else 0)
df['mid_month'] = df['date_of_month'].apply(lambda x: 1 if x >= 11 and x <= 20 else 0)
As a python beginner, if you would rather avoid lambda functions, you could achieve the same result by creating a function and then applying it as so:
def create_date_features(day, min_day, max_day):
if day >= min_day and day <= max_day:
return 1
else:
return 0
df['early_month'] = df['date_of_month'].apply(create_date_features, min_day=1, max_day=10)
df['mid_month'] = df['date_of_month'].apply(create_date_features, min_day=11, max_day=20)

I believe you need convert boolean mask to integers - Trues are processes like 1s:
rng = pd.date_range('2017-04-03', periods=10, freq='17D')
df = pd.DataFrame({'purchase_date': rng, 'a': range(10)})
m2 = df["purchase_date"].dt.day <= 10
df['early_month'] = m2.astype(int)
print (df)
purchase_date a early_month
0 2017-04-03 0 1
1 2017-04-20 1 0
2 2017-05-07 2 1
3 2017-05-24 3 0
4 2017-06-10 4 1
5 2017-06-27 5 0
6 2017-07-14 6 0
7 2017-07-31 7 0
8 2017-08-17 8 0
9 2017-09-03 9 1
Detail:
print (df["purchase_date"].dt.day <= 10)
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 False
9 True
Name: purchase_date, dtype: bool

Maybe you need this one:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a':[1,2,3,4,5], 'time':['11.07.2018','12.07.2018','13.07.2018','14.07.2018','15.07.2018']})
df.time = pd.to_datetime(df.time, format='%d.%m.%Y')
df[df.time>datetime(2018,7,13)] #if you need filter for date
df[df.time>datetime(2018,7,13).day] #if you need filter for day

find alphabets count per date in python pandas

I've a list of dates and alphabets. I've to find count of alphabets occurring within a week. I'm trying to group by them by alphabets and re-sample it by '1w'. But i get some weird data frame which contains MultiIndex. How can i do all this and get the DataFrame with just three rows containing score, new re-sample date and count?
PS: What i'm looking for is a week and count for occurrence of every alphabet in that week.
something like that
datetime alphabet count
2016-12-27 22:57:45.407246 a 1
2016-12-30 22:57:45.407246 a 2
2017-01-02 22:57:45.407246 a 0
2016-12-27 22:57:45.407246 b 0
2016-12-30 22:57:45.407246 b 1
2017-01-02 22:57:45.407246 b 4
2016-12-27 22:57:45.407246 c 7
2016-12-30 22:57:45.407246 c 0
2017-01-02 22:57:45.407246 c 0
Here is the code
import random
import pandas as pd
import datetime
def randchar(a, b):
return chr(random.randint(ord(a), ord(b)))
# Create a datetime variable for today
base = datetime.datetime.today()
# Create a list variable that creates 365 days of rows of datetime values
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
score_list =[randchar('a', 'h') for i in range(365)]
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df.index = df['datetime']
# Create a column from the numeric score variable
df['score'] = score_list
df_s = tt = df.groupby('score').resample('1w').count()

You can apply a groupby + count with 2 predicates -
pd.Grouper with a frequency of a week
score column
Finally, unstack the result.
df = df.groupby([pd.Grouper(freq='1w'), 'score']).count().unstack(fill_value=0)
df.head()
datetime
score a b c d e f g h
datetime
2016-12-25 0 0 1 1 0 1 0 1
2017-01-01 1 0 0 1 3 0 2 0
2017-01-08 0 3 1 1 1 0 0 1
2017-01-15 1 2 0 2 0 0 1 1
2017-01-22 0 1 2 1 1 2 0 0

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!

I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1

In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1

Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Pandas to calculate rolling aggregate rate

I'm trying to calculate a rolling aggregate rate for a time series.
The way to think about the data is that it is the results of a bunch of multigame series against a different teams. We don't know who wins the series until the last game. I'm trying to calculate the win rate as it evolves against each of the opposing teams.
series_id date opposing_team won_series
1 1/1/2000 a 0
1 1/3/2000 a 0
1 1/5/2000 a 1
2 1/4/2000 a 0
2 1/7/2000 a 0
2 1/9/2000 a 0
3 1/6/2000 b 0
Becomes:
series_id date opposing_team won_series percent_win_against_team
1 1/1/2000 a 0 NA
1 1/3/2000 a 0 NA
1 1/5/2000 a 1 100
2 1/4/2000 a 0 NA
2 1/7/2000 a 0 100
2 1/9/2000 a 0 50
3 1/6/2000 b 0 0

I still don't feel like I understand the rule for how you decide when a series is over. Is 3 over? Why is it NA, I would have thought 1/3rd. Still, here is a way to keep track of the number of completed series and (a) win rate.
Define 26472215table.csv:
series_id,date,opposing_team,won_series
1,1/1/2000,a,0
1,1/3/2000,a,0
1,1/5/2000,a,1
2,1/4/2000,a,0
2,1/7/2000,a,0
2,1/9/2000,a,0
3,1/6/2000,b,0
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('26472215table.csv')
grp2 = df.groupby(['series_id'])
sr = grp2['date'].max()
sr.name = 'LastGame'
df2 = df.join( sr, on=['series_id'], how='left')
df2.sort('date')
df2['series_comp'] = df2['date'] == df2['LastGame']
df2['running_sr_cnt'] = df2.groupby(['opposing_team'])['series_comp'].cumsum()
df2['running_win_cnt'] = df2.groupby(['opposing_team'])['won_series'].cumsum()
winrate = lambda x: x[1]/ x[0] if (x[0] > 0) else None
df2['winrate'] = df2[['running_sr_cnt','running_win_cnt']].apply(winrate, axis = 1 )
Results df2[['date', 'winrate']]:
date winrate
0 1/1/2000 NaN
1 1/3/2000 NaN
2 1/5/2000 1.0
3 1/4/2000 1.0
4 1/7/2000 1.0
5 1/9/2000 0.5
6 1/6/2000 0.0

Pandas / Numpy: Issues with np.where

I have a strange problem with np.where. I first load a database called df and create a duplicate of df, df1. I then use np.where to make each value in df1 be 1 if the number in the cell is greater or equal to its mean (found in the DataFrame df_mean) else make the cell equal to 0. I use a for loop to iterate over each column headers in df1 and through a list of mean values df_mean. Here's my code:
#Load the data
df = pd.read_csv('F:\\file.csv')
df.head(2)
>>> A AA AAP AAPL ABC
2011-01-10 09:30:00 -0.000546 0.006528 -0.001051 0.034593 -0.000095 ...
2011-01-10 09:30:10 -0.000256 0.007705 -0.001134 0.008578 -0.000549 ...
# Show list file with columns average
>>> df_mean.head(4)
A 0.000656
AA 0.002068
AAP 0.001134
AAPL 0.001728
...
df_1 = df
for x in list:
df_1[x] = np.where(df_1[x] >= *df_mean[x], 1, 0)
>>> df_1.head(4) #Which is my desired output (but which also makes df = df_1...WHY?)
A AA AAP AAPL ABC
2011-01-10 09:30:00 0 1 0 1 0 ...
2011-01-10 09:30:10 0 1 0 1 0 ...
2011-01-10 09:30:20 0 0 0 1 0 ...
2011-01-10 09:30:30 0 0 0 1 1 ...
Now, I get what I want which is a binary 1/0 matrix for df_1, but it turns that df also gets into a binary matrix (same as df_1). WHY? The loop does not incorporate df...

Although this is not what you asked for, but my spidy sense tells me, you want to find some form of indicator, if a stock is currently over or underperforming in regard of "something" using the mean of this "something". Maybe try this:
S = pd.DataFrame(
np.array([[1.2,3.4],[1.1,3.5],[1.4,3.3],[1.2,1.6]]),
columns=["Stock A","Stock B"],
index=pd.date_range("2014-01-01","2014-01-04",freq="D")
)
indicator = S > S.mean()
binary = indicator.astype("int")
print S
print indicator
print binary
This gives the output:
Stock A Stock B
2014-01-01 1.2 3.4
2014-01-02 1.1 3.5
2014-01-03 1.4 3.3
2014-01-04 1.2 1.6
[4 rows x 2 columns]
Stock A Stock B
2014-01-01 False True
2014-01-02 False True
2014-01-03 True True
2014-01-04 False False
[4 rows x 2 columns]
Stock A Stock B
2014-01-01 0 1
2014-01-02 0 1
2014-01-03 1 1
2014-01-04 0 0
[4 rows x 2 columns]
While you are at it, you should probably look into pd.rolling_mean(S, n_periods_for_mean).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select rows based on column condition and date time - python

Related

How to extract features using date range within a month?

find alphabets count per date in python pandas

Count number of rows for each ID within 1 year

Pandas to calculate rolling aggregate rate

Pandas / Numpy: Issues with np.where

Categories

Resources