Getting the years since an event - python

I am working on a dataset with pandas in which a maintenance work is done at a location. The maintenance is done at random intervals, sometimes a year, and sometimes never. I want to find the years since the last maintenance action at each site if an action has been made on that site. There can be more than one action for a site and the occurrences of actions are random. For the years prior to the first action, it is not possible to know the years since action because that information is not in the dataset.
I give only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year.
Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
Given this dataset; I want to have a dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 1 100 1
A 2015 0 150 2
A 2016 0 300 3
A 2017 0 80 4
B 2015 1 250 1
B 2016 1 60 1
B 2017 0 110 2
Please observe the Year 2015 is filtered out for Site B because that year is prior to the first action for that site.
Many thanks in advance!

I wrote the code myself. It is messy but does the job for me. :)
The solution assumes that df_select has an integer index.
df_select = (df_select[df_select['Site'].map((df_select.groupby('Site')['Action'].max() == 1))])
years_since_action = pd.Series(dtype='int64')
gbo = df_select.groupby('Site')
for (key,group) in gbo:
indices_with_ones = group[group['Action']==1].index
indices = group.index
group['Years_since_action'] = 0
group.loc[indices_with_ones,'Years_since_action'] = 1
for idx_with_ones in indices_with_ones.sort_values(ascending=False):
for idx in indices:
if group.loc[idx,'Years_since_action']==0:
if idx>idx_with_ones:
group.loc[idx,'Years_since_action'] = idx - idx_with_ones + 1
years_since_action = years_since_action.append(group['Years_since_action'])
df_final = pd.merge(df_select,pd.DataFrame(years_since_action),how='left',left_index=True,right_index=True)

Here is how I will approach it:
import pandas as pd
from io import StringIO
import numpy as np
s = '''Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r"\s+")
df_maintain = df[df.Action==1][['Site', 'Year']]
df_maintain.reset_index(drop=True, inplace=True)
df_maintain
def find_last_maintenance(x):
df_temp = df_maintain[x.Site == df_maintain.Site]
gap = [0]
for ind, row in df_temp.iterrows():
if (x.Year >= row['Year']):
gap.append(x.Year - row['Year'] + 1)
return gap[-1]
df['Gap'] = df.apply(find_last_maintenance, axis=1)
df = df[df.Gap !=0]
This generates the desired output.

Related

Compare two dataframes column values. Find which values are in one df and not the other

I have the following dataset
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
I want to compare the customers in 2017 and 2018 and see if the store has lost customers.
I did two subsets corresponding to 2017 and 2018 :
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
I then tried to do this to compare the two :
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
And i get the following output :
True 2206
False 324
Name: Customer ID, dtype: int64
The problem is some customers may appear several times in the dataset since they made several orders.
I would like to get only unique customers (Customer ID is the only unique attribute) and then compare the two dataframes to see how many customers the store lost between 2017 and 2018.
To go further in the analysis, you can use pd.crosstab:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
At this point your dataframe looks like:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
Values are the number of order per customer and year.
Now it's easy to get "lost customers":
>>> sum((out[2017] != 0) & (out[2018] == 0))
83
If only one comparison is required, I would use python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
output:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
building on the crosstab suggestion from #Corralien:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
output:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:
Customer_2018 = df.loc[(df.OrderYear == 2018), ​"Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), ​"Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64

Pandas groupby mean of only positive values

How to get mean of only positive values after groupby in pandas?
MWE:
import numpy as np
import pandas as pd
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)
print(flights.iloc[:2,:4])
print()
not_cancelled = flights.dropna(subset=['dep_delay','arr_delay'])
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
print(df.head())
This gives all avg_delay2 values as 16.66.
(336776, 19)
year month day dep_time
0 2013 1 1 517.0
1 2013 1 1 533.0
year month day arr_delay avg_delay2
0 2013 1 1 12.651023 16.665681
1 2013 1 2 12.692888 16.665681
2 2013 1 3 5.733333 16.665681
3 2013 1 4 -1.932819 16.665681
4 2013 1 5 -1.525802 16.665681
Which is WRONG.
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
When I do the same thing in R:
library(nycflights13)
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
df = not_cancelled %>%
group_by(year,month,day) %>%
summarize(
# average delay
avg_delay1 = mean(arr_delay),
# average positive delay
avg_delay2 = mean(arr_delay[arr_delay>0]))
head(df)
It gives correct output for avg_delay2.
year month day avg_delay1 avg_delay2
2013 1 1 12.651023 32.48156
2013 1 2 12.692888 32.02991
2013 1 3 5.733333 27.66087
2013 1 4 -1.932819 28.30976
2013 1 5 -1.525802 22.55882
2013 1 6 4.236429 24.37270
How to do this in Pandas?
I would filter the positive before groupby
df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df.head()
because, as in your code, df is an separate dataframe after the groupby operation has completed, and
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
assign the same value to df['avg_delay2']
Edit: Similar to R, you can do both in one shot using agg:
def mean_pos(x):
return x[x>0].mean()
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
df.head()
Note that from pandas 23, using dictionary in gropby agg is deprecated and will be removed in future, so we can not use that method.
Warning
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.
So, I to tackle that problem in this specific case, I came up with another idea.
Create a new column making all non-positive values nans, then do the usual groupby.
import numpy as np
import pandas as pd
# read data
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# select flights that are not cancelled
df = flights.dropna(subset=['dep_delay','arr_delay'])
# create new column to fill non-positive with nans
df['arr_delay_pos'] = df['arr_delay']
df.loc[df.arr_delay_pos <= 0,'arr_delay_pos'] = np.nan
df.groupby(['year','month','day'])[['arr_delay','arr_delay_pos']].mean().reset_index().head()
It gives:
year month day arr_delay arr_delay_positive
0 2013 1 1 12.651023 32.481562
1 2013 1 2 12.692888 32.029907
2 2013 1 3 5.733333 27.660870
3 2013 1 4 -1.932819 28.309764
4 2013 1 5 -1.525802 22.558824
Sanity check
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581

Need to calculate timedifference between timestamps and store it in a variable

I am a beginner with python, and so my questions could come across as trivial. I would appreciate your support or any leads to my problem.
Problem:
There are about 10 different states; A order moves across different states and a time stamp is generated when the state ends. For example below, There are four states A, B, C, D.
A 10 AM
B 1 PM
C 4 Pm
D 5 PM
Time spent in B = 1PM -10AM = 3.
Some times the same state can occur multiple times. Hence, we need a variable to store the time difference value for a single state
Attached the raw data csv and my code so far.There are multiple orders for which this calculation needs to be performed. however, for simplicity, I have data for just one order now.
sample data:
Order States modified_at
1 Resolved 2018-06-18T15:05:52.2460000
1 Edited 2018-05-24T21:44:07.9030000
1 Pending PO Creation 2018-06-06T19:52:51.5990000
1 Assigned 2018-05-24T17:46:03.2090000
1 Edited 2018-06-04T15:02:57.5130000
1 Draft 2018-05-24T17:45:07.9960000
1 PO Placed 2018-06-06T20:49:37.6540000
1 Edited 2018-06-04T11:18:13.9830000
1 Edited 2018-05-24T17:45:39.4680000
1 Pending Approval 2018-05-24T21:48:23.9180000
1 Edited 2018-06-06T21:00:19.6350000
1 Submitted 2018-05-24T21:44:37.8830000
1 Edited 2018-05-30T11:19:36.5460000
1 Edited 2018-05-25T11:16:07.9690000
1 Edited 2018-05-24T21:43:35.0770000
1 Assigned 2018-06-07T18:39:00.2580000
1 Pending Review 2018-05-24T17:45:10.5980000
1 Pending PO Submission 2018-06-06T14:16:26.6580000
Code I tried:
import pandas as pd
import datetime as datetime
from dateutil.relativedelta import relativedelta
fileName = "SamplePR.csv"
df = pd.read_csv(fileName, delimiter=',')
df['modified_at'] = pd.to_datetime(df.modified_at)
df = df.sort_values(by='modified_at')
df = df.reset_index(drop=True)
df1 = df[:-1]
df2 = df[1:]
dfm1 = df1['modified_at']
dfm2 = df2['modified_at']
dfm1 = dfm1.reset_index(drop=True)
dfm2 = dfm2.reset_index(drop=True)
for i in range(len(df)-1):
start = datetime.datetime.strptime(str(dfm1[i]), '%Y-%m-%d %H:%M:%S')
ends = datetime.datetime.strptime(str(dfm2[i]), '%Y-%m-%d %H:%M:%S')
diff = relativedelta(ends, start)
print (diff)
So far, I tried to sort the list by time and then calculate the difference between 2 states. Would really appreciate if someone can help with logic or point in the right direction
You can use diff from pandas to get the difference between two rows
Here is a sample code.
In [1]: import pandas as pd
In [2]: from io import StringIO
In [3]: data = StringIO('''Order,States,modified_at
...: 1,Resolved,2018-06-18T15:05:52.2460000
...: 1,Edited,2018-05-24T21:44:07.9030000
...: 1,Pending PO Creation,2018-06-06T19:52:51.5990000
...: ''')
In [4]: df = pd.read_csv(data, sep=',')
In [5]: df['modified_at'] = pd.to_datetime(df['modified_at']) #convert the type to datetime
In [6]: df
Out[6]:
Order States modified_at
0 1 Resolved 2018-06-18 15:05:52.246
1 1 Edited 2018-05-24 21:44:07.903
2 1 Pending PO Creation 2018-06-06 19:52:51.599
In [7]: df['diff'] = df['modified_at'].diff() #get the diff and add to a new column
In [8]: df
Out[8]:
Order States modified_at diff
0 1 Resolved 2018-06-18 15:05:52.246 NaT
1 1 Edited 2018-05-24 21:44:07.903 -25 days +06:38:15.657000
2 1 Pending PO Creation 2018-06-06 19:52:51.599 12 days 22:08:43.696000
Welcome visal, if your intention is to just check the time difference between date stamp , use to_datetime to convert to datestamp and difference it by shifting
index Order States modified_at
0 0 1 Resolved 2018-06-18 15:05:52.246
1 1 1 Edited 2018-05-24 21:44:07.903
2 0 1 Edited 2018-06-06 21:00:19.635
3 1 1 Submitted 2018-05-24 21:44:37.883
4 2 1 Edited 2018-05-30 11:19:36.546
5 3 1 Edited 2018-05-25 11:16:07.969
6 4 1 Edited 2018-05-24 21:43:35.077
7 5 1 Assigned 2018-06-07 18:39:00.258
df.modified_at = pd.to_datetime(df.modified_at)
df['time_spent'] = df.modified_at - df.modified_at.shift()
Out:
0 NaT
1 -25 days +06:38:15.657000
2 12 days 23:16:11.732000
3 -13 days +00:44:18.248000
4 5 days 13:34:58.663000
5 -6 days +23:56:31.423000
6 -1 days +10:27:27.108000
7 13 days 20:55:25.181000
Name: modified_at, dtype: timedelta64[ns]
you can use pivot table for your requirement
df.time_spent = df.time_spent.dt.seconds
pd.pivot_table(df,values='time_spent',index=['Order'],columns=['States'],aggfunc=np.sum)
Out:
States Assigned Edited Resolved Submitted
Order
0 NaN 83771.0 0.0 NaN
1 NaN 23895.0 NaN 2658.0
2 NaN 48898.0 NaN NaN
3 NaN 86191.0 NaN NaN
4 NaN 37647.0 NaN NaN
5 75325.0 NaN NaN NaN
$datetime1 = new DateTime('2016-11-30 03:55:06');//start time
$datetime2 = new DateTime('2016-11-30 11:55:06');//end time
$interval = $datetime1->diff($datetime2);
echo $interval->format('%Y years %m months %d days %H hours %i minutes %s seconds');//00 years 0 months 0 days 08 hours 0 minutes 0 seconds

Dataframe group by a specific column, aggerage ratio of some other column?

I have a Data Frame with columns: Year and Min Delay. Sample rows as follows:
2014 0
2014 2
2014 0
2014 4
2015 4
2015 4
2015 2
2015 2
I want to group this dataframe by year and find the delay ratio per year (i.e. number of non-zero entries that year divided by total number of entries for that year). So if we consider the data frame above, what I am trying to get is:
2014 0.5
2015 1
(There are 2 delays in 2014, total 4, 4 delays in 2015 total 4. A delay is defined by Min Delay > 0)
This is what I tried:
def find_ratio(df):
ratio = 1 - (len(df[df == 0]) / len(df))
return ratio
print(df.groupby(["Year"])["Min Delay"].transform(find_ratio).unique())
which prints: [0.5 1]
How can I get a data frame instead of an array?
First I think unique is not good idea use here. Because if need assign output of function to years, it is impossible.
Also transform is good idea if need new column to DataFrame, not aggregated DataFrame.
I think need GroupBy.apply, also function should be simplify by mean of boolean mask:
def find_ratio(df):
ratio = (df != 0).mean()
return ratio
print(df.groupby(["Year"])["Min Delay"].apply(find_ratio).reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with lambda function:
print (df.groupby(["Year"])["Min Delay"]
.apply(lambda x: (x != 0).mean())
.reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with GroupBy.transform return new column:
df['ratio'] = df.groupby(["Year"])["Min Delay"].transform(find_ratio)
print (df)
Year Min Delay ratio
0 2014 0 0.5
1 2014 2 0.5
2 2014 0 0.5
3 2014 4 0.5
4 2015 4 0.0
5 2015 4 0.0
6 2015 2 0.0
7 2015 2 0.0

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories