Performing operations on group by based on column value Pandas - python

I have a grouped pandas dataframe
x y id date qty
6 3 932 2017-05-14 212
6 3 932 2017-05-15 212
6 3 932 2017-05-18 212
6 3 933 2016-10-03 518
6 3 933 2016-10-09 16
6 3 933 2016-10-15 28
I want to know how to get the number of days between each order for a particular id. The first date should be the 0th day and the consecutive column values the number of days after the first order. Something like this
x y id date qty
6 3 932 0 212
6 3 932 1 212
6 3 932 3 212
6 3 933 0 518
6 3 933 6 16
6 3 933 6 28

You can groupby by id and get diff, repalce NaT with fillna and last get days:
print (df)
x y id date qty
0 6 3 932 2017-05-14 212
1 6 3 932 2017-05-15 212
2 6 3 932 2017-05-18 212
3 6 3 933 2016-10-03 518
4 6 3 933 2016-10-09 16
5 6 3 933 2016-10-15 28
#if necessary convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'] = df.groupby(['id'])['date'].diff().fillna(0).dt.days
print (df)
x y id date qty
0 6 3 932 0 212
1 6 3 932 1 212
2 6 3 932 3 212
3 6 3 933 0 518
4 6 3 933 6 16
5 6 3 933 6 28
And Zero's solution is very similar, only output is float and not int, because of ordering of functions.

Use diff() on date of id groups, then using accessor to get dt.days days, fill NaNs with 0
In [772]: df.groupby('id')['date'].diff().dt.days.fillna(0)
Out[772]:
0 0.0
1 1.0
2 3.0
3 0.0
4 6.0
5 6.0
Name: date, dtype: float64
In [773]: df['date'] = df.groupby('id')['date'].diff().dt.days.fillna(0)
In [774]: df
Out[774]:
x y id date qty
0 6 3 932 0.0 212
1 6 3 932 1.0 212
2 6 3 932 3.0 212
3 6 3 933 0.0 518
4 6 3 933 6.0 16
5 6 3 933 6.0 28
Details
Original df
In [776]: df
Out[776]:
x y id date qty
0 6 3 932 2017-05-14 212
1 6 3 932 2017-05-15 212
2 6 3 932 2017-05-18 212
3 6 3 933 2016-10-03 518
4 6 3 933 2016-10-09 16
5 6 3 933 2016-10-15 28
In [778]: df.dtypes
Out[778]:
x int64
y int64
id int64
date datetime64[ns]
qty int64
dtype: object

Related

Pandas groupby apply a random day to each group of years

I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64
You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6
With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366
​
years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))
​
Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164

Transform each group in a DataFrame

I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?
df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0
You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0

Generating rows in a pandas dataframe to make up for missing values of a column (or multiple columns)

I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0

Error while dropping row from dataframe based on value comparison

I have following unique values in dataframe column.
['1473' '1093' '1346' '1324' 'NA' '1129' '58' '847' '54' '831' '816']
I want to drop rows which have 'NA' in this column.
testData = testData[testData.BsmtUnfSF != "NA"]
and got error
TypeError: invalid type comparison
Then I tried
testData = testData[testData.BsmtUnfSF != np.NAN]
It doesn't give any error but it doesn't drop rows.
How to solve this issue?
Here is how you can do it. Just change column with the column name you want.
import pandas as pd
import numpy as np
df = pd.DataFrame({"column": [1,2,3,np.nan,6]})
df = df[np.isfinite(df['column'])]
You could use dropna
testData = testData.dropna(subsets = 'BsmtUnfSF']
assuming your dataFrame:
>>> df
col1
0 1473
1 1093
2 1346
3 1324
4 NaN
5 1129
6 58
7 847
8 54
9 831
10 816
You have multiple solutions:
>>> df[pd.notnull(df['col1'])]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[df.col1.notnull()]
# df[df['col1'].notnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna(subset=['col1'])
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna()
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[~df.col1.isnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816

group by within group by in pandas

Consider the following dataset:
min 5-min a
0 0 800
0 0 801
1 0 802
1 0 803
1 0 804
2 0 805
2 0 805
2 0 810
3 0 801
3 0 802
3 0 803
4 0 804
4 0 805
5 1 806
5 1 800
5 1 890
6 1 890
6 1 880
6 1 800
7 1 804
7 1 806
8 1 801
9 1 800
9 1 900
10 1 770
10 1 803
10 1 811
I need to calculate std of a on each group based on the minute and then calculate the mean of the results values in each group of 5 min.
I do not know how to find the border of 5 min, after calculation of std.
How should I save the data to know which std belong to each group of 5 min?
data.groupby('minute').a.std()
I would appreciate of any help.
Taskos answer is great but I wasn't sure if you needed the data to be pushed back into the dataframe or not. Assuming what you want is to add the new columns in the parent after each groupby operation, Ive opted to do that for you as follows
import pandas as pd
df = your_df
# First we create the standard deviation column
def add_std(grp):
grp['stdevs'] = grp['a'].std()
return grp
df = df.groupby('min').apply(add_std)
# Next we create the 5 minute mean column
def add_meandev(grp):
grp['meandev'] = grp['stdevs'].mean()
return grp
print(df.groupby('5-min').apply(add_meandev))
This can be done more elegantly by chaining etc but I have opted to lay it out like this so that the underlying process is more visible to you.
The final output from this will look like the following:
min 5-min a stdevs meandev
0 0 0 800 0.707107 1.345283
1 0 0 801 0.707107 1.345283
2 1 0 802 1.000000 1.345283
3 1 0 803 1.000000 1.345283
4 1 0 804 1.000000 1.345283
5 2 0 805 2.886751 1.345283
6 2 0 805 2.886751 1.345283
7 2 0 810 2.886751 1.345283
8 3 0 801 1.000000 1.345283
9 3 0 802 1.000000 1.345283
10 3 0 803 1.000000 1.345283
11 4 0 804 0.707107 1.345283
12 4 0 805 0.707107 1.345283
13 5 1 806 50.318983 39.107147
14 5 1 800 50.318983 39.107147
15 5 1 890 50.318983 39.107147
16 6 1 890 49.328829 39.107147
17 6 1 880 49.328829 39.107147
18 6 1 800 49.328829 39.107147
19 7 1 804 1.414214 39.107147
20 7 1 806 1.414214 39.107147
21 8 1 801 NaN 39.107147
22 9 1 800 70.710678 39.107147
23 9 1 900 70.710678 39.107147
24 10 1 770 21.733231 39.107147
25 10 1 803 21.733231 39.107147
26 10 1 811 21.733231 39.107147
Not 100% clear on what you are asking... but I think this is what you need:
data.groupby(['min','5-min']).std().groupby('5-min').mean()
This finds the standard deviation based on the 5-min column of the means calculated based on the 'min' column.

Categories