group by pandas dataframe and condition - python

My question is based on this thread, where we group values of a pandas dataframe and select the latest (by date) from each group:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
using the following
df.loc[df.groupby('id').date.idxmax()]
Say, however, that I want to include the condition that I only want to select the latest (by date) from each group within +/- 5 days. I.e., after grouping I want to find the latest within the following groups:
0 220 6647 2014-09-01 #because only these two are within +/- 5 days of each other
1 220 6647 2014-09-03
2 220 6647 2014-10-16 #spaced more than 5 days apart the above two records
3 826 3380 2014-11-11
.....
which yields
id product date
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
Dataset with price:
id product date price
0 220 6647 2014-09-01 100 #group 1
1 220 6647 2014-09-03 120 #group 1 --> pick this
2 220 6647 2014-09-05 0 #group 1
3 826 3380 2014-11-11 150 #group 2 --> pick this
4 826 3380 2014-12-09 23 #group 3 --> pick this
5 826 3380 2015-05-12 88 #group 4 --> pick this
6 901 4555 2015-05-15 32 #group 4
7 901 4555 2015-10-05 542 #group 5 --> pick this
8 901 4555 2015-11-01 98 #group 6 --> pick this

I think you need create groups by apply with list comprehension and between, then convert to numeric groups by factorize, last use your solution with loc + idxmax:
df['date'] = pd.to_datetime(df['date'])
df = df.reset_index(drop=True)
td = pd.Timedelta('5 days')
def f(x):
x['g'] = [tuple((x.index[x['date'].between(i - td, i + td)])) for i in x['date']]
return x
df2 = df.groupby('id').apply(f)
df2['g'] = pd.factorize(df2['g'])[0]
print (df2)
id product date price g
0 220 6647 2014-09-01 100 0
1 220 6647 2014-09-03 120 0
2 220 6647 2014-09-05 0 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6
df3 = df2.loc[df2.groupby('g')['price'].idxmax()]
print (df3)
id product date price g
1 220 6647 2014-09-03 120 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6

Or use a two-liner:
df2=pd.to_numeric(df.groupby('id')['date'].diff(-1).astype(str).str[:-25]).abs().fillna(6)
print(df.loc[df2.index[df2>5].tolist()])
Output:
id product date
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
So use diff and slice using string slice, and absolute all the values, then drop the ones less than 5, get those indexes, then get the indexes in the in df.

Related

Unable to filter out a specific value for a dataset

I want to filter out a specific value(9999) that appears many times from a subset of my dataset. This is what I have done so far but I'm not sure how to filter out all the 9999 values.
import pandas as pd
import statistics
df=pd.read_csv('Area(2).txt',delimiter='\t')
Initially, this is what a part of my dataset for 30 days (containing 600+ values) looks like below. I'm just showing the first two rows here.
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
Now I wanted to select the range of numbers under the column "Values" between 23-25 April. So I did the following:
df5=df.iloc[528:602,5]
print(df5)
The range of values I get for 23-25 April looks like this:
528 9999
529 9999
530 9999
531 9999
532 9999
597 9999
598 9999
599 9999
600 9999
601 9999
Name: Value, Length: 74, dtype: int64
I want to filter out all 9999 values from this subset, I have tried a number of ways to get rid of these values but I keep getting IndexError: positional indexers are out-of-bounds so I am unable to get rid of 9999 and do further work like finding the variance and standard deviation with the selected subset.
If this helps, I also tried to filter out 9999 in the beginning and it looked like this:
df2=df[df.Value!=9999]
print(df2)
No Date Time Rand Col Value
6 2167 1 4 1991 6:00 181 7 152
7 2168 1 4 1991 7:00 181 8 178
8 2169 1 4 1991 8:00 181 9 239
9 2170 1 4 1991 9:00 181 10 296
10 2171 1 4 1991 10:00 181 11 337
.. ... ... ... ... ... ...
638 2799 27 4 1991 14:00 234 3 193
639 2800 27 4 1991 15:00 234 4 162
640 2801 27 4 1991 16:00 234 5 144
641 2802 27 4 1991 17:00 234 6 151
642 2803 27 4 1991 18:00 234 7 210
[351 rows x 6 columns]
Then I tried to obtain the range of column values between 23 April - 25 April by trying what I did below but I always get IndexError: positional indexers are out-of-bounds
df6=df2.iloc[528:602,5]
print(df6)
How I can properly filter out the value I mentioned and obtain the subset of the dataset that I need?
Given:
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
2 2167 1 4 1991 6:00 181 7 152
3 2168 1 4 1991 7:00 181 8 178
4 2169 1 4 1991 8:00 181 9 239
5 2170 1 4 1991 9:00 181 10 296
6 2171 1 4 1991 10:00 181 11 337
7 2799 27 4 1991 14:00 234 3 193
8 2800 27 4 1991 15:00 234 4 162
9 2801 27 4 1991 16:00 234 5 144
10 2802 27 4 1991 17:00 234 6 151
11 2803 27 4 1991 18:00 234 7 210
First, let's make a proper datetime index:
# Your dates are pretty scuffed, there was some formatting to make them make sense...
df.index = pd.to_datetime(df.Date.str.split().apply(lambda x: f'{x[1].zfill(2)}-{x[0].zfill(2)}-{x[2]}') + ' ' + df.Time)
df.drop(['Date', 'Time'], axis=1, inplace=True)
This gives:
No Rand Col Value
1991-04-01 00:00:00 2161 181 1 9999
1991-04-01 01:00:00 2162 181 2 9999
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
1991-04-27 14:00:00 2799 234 3 193
1991-04-27 15:00:00 2800 234 4 162
1991-04-27 16:00:00 2801 234 5 144
1991-04-27 17:00:00 2802 234 6 151
1991-04-27 18:00:00 2803 234 7 210
Then, we can easily fulfill your conditions (replace the dates with your own desired range).
df[df.Value.ne(9999)].loc['1991-04-01':'1991-04-01']
# df[df.Value.ne(9999)].loc['1991-04-23':'1991-04-25']
Output:
No Rand Col Value
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337

group by pandas dataframe and select next upcoming date in each group

Same question as here: group by pandas dataframe and select latest in each group, except instead of latest date, would like to get next upcoming date for each group.
So given a dataframe sorted by date:
id product date
0 220 6647 2020-09-01
1 220 6647 2020-10-03
2 220 6647 2020-12-16
3 826 3380 2020-11-11
4 826 3380 2020-12-09
5 826 3380 2021-05-19
6 901 4555 2020-09-01
7 901 4555 2020-12-01
8 901 4555 2021-11-01
Using todays date (2020-12-01) to determine the next upcoming date, grouping by id or product and selecting the the next upcoming date should give:
id product date
2 220 6647 2020-12-16
5 826 3380 2020-12-09
8 901 4555 2021-11-01
Filter the dates first, then drop duplicates:
df[df['date']>'2020-12-01'].sort_values(['id','date']).drop_duplicates('id')
Output:
id product date
2 220 6647 2020-12-16
4 826 3380 2020-12-09
8 901 4555 2021-11-01

How to keep the first/last group in a group

How to keep the last group within a group using Pandas?
For example, given the following dataset:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 6647 2014-11-11
4 826 6647 2014-12-09
5 826 6647 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
9 401 4555 2015-01-05
10 401 4555 2015-02-01
how to get the last id group from each of the product group succinctly?
id product date
3 826 6647 2014-11-11
4 826 6647 2014-12-09
5 826 6647 2015-05-19
9 401 4555 2015-01-05
10 401 4555 2015-02-01

Parsing week of year to datetime objects with pandas

A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]

Pandas if/then aggregation

I've been searching SO and haven't figured this out yet. Hoping someone can aide this python newb to solving my problem.
I'm trying to figure out how to write an if/then statement in python and perform an aggregation off that if/then statement. My end goal is to say if the date = 1/7/2017 then use the value in the "fake" column. If date = all else then average the two columns together.
Here is what I have so far:
import pandas as pd
import numpy as np
import datetime
np.random.seed(42)
dte=pd.date_range(start=datetime.date(2017,1,1), end= datetime.date(2017,1,15))
fake=np.random.randint(15,100, size=15)
fake2=np.random.randint(300,1000,size=15)
so_df=pd.DataFrame({'date':dte,
'fake':fake,
'fake2':fake2})
so_df['avg']= so_df[['fake','fake2']].mean(axis=1)
so_df.head()
Assuming you have already computed the average column:
so_df['fake'].where(so_df['date']=='20170107', so_df['avg'])
Out:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 449.0
9 395.5
10 197.0
11 438.5
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
If not, you can replace the column reference with the same calculation:
so_df['fake'].where(so_df['date']=='20170107', so_df[['fake','fake2']].mean(axis=1))
To check for multiple dates, you need to use the element-wise version of the or operator (which is pipe: |). Otherwise it will raise an error.
so_df['fake'].where((so_df['date']=='20170107') | (so_df['date']=='20170109'), so_df['avg'])
The above checks for two dates. In the case of 3 or more, you may want to use isin with a list:
so_df['fake'].where(so_df['date'].isin(['20170107', '20170109', '20170112']), so_df['avg'])
Out[42]:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 38.0
9 395.5
10 197.0
11 67.0
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
Let's use np.where:
so_df['avg'] = np.where(so_df['date'] == pd.to_datetime('2017-01-07'),
so_df['fake'], so_df[['fake',
'fake2']].mean(1))
Output:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
One way to do if-else in pandas is by using np.where
There are three values inside, condition, if and else
so_df['avg']= np.where(so_df['date'] == '2017-01-07',so_df['fake'],so_df[['fake','fake2']].mean(axis=1))
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
we can also use Series.where() method:
In [141]: so_df['avg'] = so_df['fake'] \
...: .where(so_df['date'].isin(['2017-01-07','2017-01-09']))
...: .fillna(so_df[['fake','fake2']].mean(1))
...:
In [142]: so_df
Out[142]:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 38.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5

Categories