I have data about machine failures. The data is in a pandas dataframe with date, id, failure and previous_30_days columns. The previous_30_days column is currently all zeros. My desired outcome is to populate rows in the previous_30_days column with a '1' if they occur within a 30 day time-span before a failure. I am currently able to do this with the following code:
failure_df = df[(df['failure'] == 1)] # create a dataframe of just failures
for index, row in failure_df.iterrows():
df.loc[(df['date'] >= (row.date - datetime.timedelta(days=30))) &
(df['date'] <= row.date) & (df['id'] == row.id), 'previous_30_days'] = 1
Note that I also check for the id match, because dates are repeated in the dataframe, so I cannot assume it is simply the previous 30 rows.
My code works, but the problem is that the dataframe is millions of rows, and this code is too slow at the moment.
Is there a more efficient way to achieve the desired outcome? Any thoughts would be very much appreciated.
I'm a little confused about how your code works (or is supposed to work), but this ought to point you in the right direction and can be easily adapted. It will be much faster by avoiding iterrows in favor of vectorized operations (about 7x faster for this small dataframe, it should be a much bigger improvement on your large dataframe).
np.random.seed(123)
df=pd.DataFrame({ 'date':np.random.choice(pd.date_range('2015-1-1',periods=300),20),
'id':np.random.randint(1,4,20) })
df=df.sort(['id','date'])
Now, calculate days between current and previous date (by id).
df['since_last'] = df.groupby('id')['date'].apply( lambda x: x - x.shift() )
Then create your new column based on the number of days to the previous date.
df['previous_30_days'] = df['since_last'] < datetime.timedelta(days=30)
date id since_last previous_30_days
12 2015-02-17 1 NaT False
6 2015-02-27 1 10 days True
3 2015-03-25 1 26 days True
0 2015-04-09 1 15 days True
10 2015-04-24 1 15 days True
5 2015-05-04 1 10 days True
11 2015-05-07 1 3 days True
8 2015-08-14 1 99 days False
14 2015-02-02 2 NaT False
9 2015-04-07 2 64 days False
19 2015-07-28 2 112 days False
7 2015-08-03 2 6 days True
15 2015-08-13 2 10 days True
1 2015-08-19 2 6 days True
2 2015-01-18 3 NaT False
13 2015-03-15 3 56 days False
18 2015-04-07 3 23 days True
4 2015-04-17 3 10 days True
16 2015-04-22 3 5 days True
17 2015-09-11 3 142 days False
Related
I have a column with data type 'o'. It has numbers, as well String. For example:
Days
5
10
15
7
No Sales Data available
9
I am trying to make a separate column using np.where, where I have written the code as
np.where(df['Days']=='No Sales Data available','No Sales',np.where(df['Days']<=10, 'Less than 10 days Sales','More than 10 Days Sales'))
Naturally, the code is giving problems due to mixed data types. Any idea how to get around such cases?
You could rewrite your statement in this way which will preserve the data type of your 'Days' column.
df['new'] = np.where(pd.to_numeric(df['Days'],errors='coerce').isna(),'No Sale',
np.where(pd.to_numeric(df['Days'],errors='coerce') <= 10,
'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5 Less than 10 days Sales
1 10 Less than 10 days Sales
2 15 More than 10 Days Sales
3 7 Less than 10 days Sales
4 No Sales Data available No Sale
5 9 Less than 10 days Sales
If you don't mind changing the type of your column, you could first convert to numeric and following a similar logic:
df['Days'] = pd.to_numeric(df['Days'],errors='coerce')
df['new'] = np.where(df['Days'].isna(),'No Sale',np.where(df['Days']<=10,'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5.0 Less than 10 days Sales
1 10.0 Less than 10 days Sales
2 15.0 More than 10 Days Sales
3 7.0 Less than 10 days Sales
4 NaN No Sale
5 9.0 Less than 10 days Sales
I have a dataframe like the following:
id_cliente id_ordine data_ordine id_medium
0 madinside IML-0042758 2016-08-23 1190408
1 lisbeth19 IML-0071225 2017-02-26 1205650
2 lisbeth19 IML-0072944 2017-03-15 1207056
3 lisbeth19 IML-0077676 2017-05-12 1211395
4 lisbeth19 IML-0077676 2017-05-12 1207056
5 madinside IML-0094979 2017-09-29 1222195
6 lisbeth19 IML-0099675 2017-11-15 1211446
7 lisbeth19 IML-0099690 2017-11-15 1225212
8 lisbeth19 IML-0101439 2017-12-02 1226511
9 lisbeth19 IML-0109883 2018-03-14 1226511
I would like to add three columns:
the first column could be named "number of order per client" and should be the progression of orders made by the same client.
So order IML-0042758 should be 1, IML-0071225 should be 1, IML-0072944 should be 2, IML-0077676 should be 3, IML-0094979 should be 2, and so on..
the second column could be named "days between first and n order of the same client" and shows the the "data_ordine" difference (a datetime column) between the different orders made by the same client.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 17 (2017-03-15 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 402 (2017-09-29 - 2017-02-26).
the third column could be named "days between first and n order of the same id_medium" and shows the the "data_ordine" difference (a datetime column) between the different orders per id_medium.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 0 (2017-03-15 - 2017-03-15), 0 (2017-05-12 - 2017-05-12), 58 (2017-05-12 - 2017-03-15 because the medium "1207056" is ordered for the second time), 0 (2017-09-29 - 2017-09-29).
In the end I would like to calculate how long it takes in average for a client to make a second order, a third order, a fourth order and so on.
And how long it takes in average for a client to make a second, third (etc.) order for the same id_medium.
First convert to datetime and sort so the calculations are reliable.
The first column we can use groupby + ngroup to label each order, then we subtract the min from each person so they all start from 1
Days from 1st order, use groupby + transform to get the first date of each client then subtract
Third column is the same, just add id_medium to the grouping
Code:
df['data_ordine'] = pd.to_datetime(df['data_ordine'])
df = df.sort_values('data_ordine')
df['Num_ords'] = df.groupby(['id_cliente', 'id_ordine']).ngroup()
df['Num_ords'] = df.Num_ords - df.groupby(['id_cliente']).Num_ords.transform('min')+1
df['days_bet'] = (df.data_ordine -df.groupby('id_cliente').data_ordine.transform('min')).dt.days
df['days_bet_id'] = (df.data_ordine - df.groupby(['id_cliente', 'id_medium']).data_ordine.transform('min')).dt.days
Output:
id_cliente id_ordine data_ordine id_medium Num_ords days_bet days_bet_id
0 madinside IML-0042758 2016-08-23 1190408 1 0 0
1 lisbeth19 IML-0071225 2017-02-26 1205650 1 0 0
2 lisbeth19 IML-0072944 2017-03-15 1207056 2 17 0
3 lisbeth19 IML-0077676 2017-05-12 1211395 3 75 0
4 lisbeth19 IML-0077676 2017-05-12 1207056 3 75 58
5 madinside IML-0094979 2017-09-29 1222195 2 402 0
6 lisbeth19 IML-0099675 2017-11-15 1211446 4 262 0
7 lisbeth19 IML-0099690 2017-11-15 1225212 5 262 0
8 lisbeth19 IML-0101439 2017-12-02 1226511 6 279 0
9 lisbeth19 IML-0109883 2018-03-14 1226511 7 381 102
I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.
I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.
I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4