Expanding just last row of groupby - python

I need get mean of expanding grouped by name.
I already have this code:
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'number': [1, 3, 5, 7, 9, 11, 13, 15]
}
df = pd.DataFrame(data)
df['mean_number'] = df.groupby('name')['number'].apply(
lambda s: s.expanding().mean().shift()
)
Ps: I use .shift() for the mean not to include the current line
Result in this:
id name number mean_number
0 1 A 1 NaN
1 2 B 3 NaN
2 3 A 5 1.0
3 4 B 7 3.0
4 5 A 9 3.0
5 6 B 11 5.0
6 7 A 13 5.0
7 8 B 15 7.0
Works, but I only need the last result of each groupby.
id name number mean_number
6 7 A 13 5.0
7 8 B 15 7.0
I would like to know if it is possible to get the mean of only these last lines, because in a very large dataset, it takes a long time to create the variables of all the lines and filter only the last ones.

If you only need the last two mean numbers you can just take the sum and count per group and calculate the values like this:
groups = df.groupby('name').agg(name=("name", "first"), s=("number", "sum"), c=("number", "count")).set_index("name")
groups
s c
name
A 28 4
B 36 4
Then you can use .tail() to get the last row for each group
tail = df.groupby('name').tail(1).set_index("name")
tail
id number
name
A 7 13
B 8 15
Calculate the mean like this
(groups.s - tail.number) / (groups.c - 1)
name
A 5.0
B 7.0

Related

Apply different mathematical function in table in Python

I have two columns - Column A and Column B and it has some values like below:-
Now, I want to apply normal arithmetic function for each row and add result in next column. But Different arithmetic operator should be apply on each row. Like
A+B for first row
A-B for second row
A*B for third row
A/B for fourth row
and so on till nth record of the row with same repetitive mathematical function.
Can someone please help me with this code in Python.
python-3.x
pandas
We can use:
row.name to access the index when using apply on a row
can use a dictionary to map indexes to a operations
Code
import operator as _operator
# Data
d = {"A":[5, 6, 7, 8, 9, 10, 11],
"B": [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
print(df)
# Mapping from index to mathematical operation
operator_map = {
0: _operator.add,
1: _operator.sub,
2: _operator.mul,
3: _operator.truediv,
}
# use row.name % 4 to have operators have a cycle of 4
df['new'] = df.apply(lambda row: operator_map[row.name % 4](*row), axis = 1)
Output
Initial df
A B
0 5 1
1 6 2
2 7 3
3 8 4
4 9 5
5 10 6
6 11 7
New df
A B new
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0
4 9 5 14.0
5 10 6 4.0
6 11 7 77.0
IIUC, you can try DataFrame.apply on rows with operator
import operator
operators = [operator.add, operator.sub, operator.mul, operator.truediv]
df['C'] = df.apply(lambda row: operators[row.name](*row), axis=1)
print(df)
A B C
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0

How do I create features in featuretools for rows with the same id and a time index?

I have a Dataframe like this
data = {'Customer':['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3'],
'NumOfItems':[3, 2, 4, 5, 5, 6, 10, 6, 14],
'PurchaseTime':["2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03"]
}
df = pd.DataFrame(data)
df
I want to create a Feature which is for example the max value for each customer up to this point:
'MaxPerID(NumOfItems)':[3, 3, 4, 5, 5, 6, 10, 10, 14] #the output i want
So I set up the EntitySet and normalize it …
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customer",
dataframe=df,
index='index',
time_index="PurchaseTime",
make_index=True)
es = es.normalize_entity(base_entity_id="customer",
new_entity_id="sessions",
index="Customer")
But creating the feature matrix doesn't produce the results i want.
feature_matrix, features = ft.dfs(entityset=es,
target_entity="customer",
agg_primitives = ["max"],
max_depth = 3
)
feature_matrix.head
sessions.MAX(customer.NumOfItems)
index
0 4
3 6
6 14
1 4
4 6
7 14
2 4
5 6
8 14
The returned feature is the max value per day from all customers (sorted by time), however if I run the same code without the time_index = "PurchaseTime" the result is the max value just for the specific customer
sessions.MAX(customer.NumOfItems) \
index
0 4
1 4
2 4
3 6
4 6
5 6
6 14
7 14
8 14
I want a combination of these two: the max value for the specific customer up to this point.
Is this possible? I tried to work with es['customer']['Customer'].interesting_values =['C1', 'C2', 'C3'] but it didn't get me anywhere. I also tried modifying the new normalized entity and writing my own primitive for this.
I'm new to featuretools so any help would be greatly appreciated.
This Question is similar to mine but the solution has no time_index and is creating the new features on the normalized entity
Thanks for the question. You can get the expected output by using a group by transform primitive.
fm, fd = ft.dfs(
entityset=es,
target_entity="customer",
groupby_trans_primitives=['cum_max'],
)
You should get the cumulative max of the number of items per customer.
column = 'CUM_MAX(NumOfItems) by Customer'
actual = fm[[column]].sort_values(column)
expected = {'MaxPerID(NumOfItems)': [3, 3, 4, 5, 5, 6, 10, 10, 14]}
actual.assign(**expected)
CUM_MAX(NumOfItems) by Customer MaxPerID(NumOfItems)
index
0 3.0 3
1 3.0 3
2 4.0 4
3 5.0 5
4 5.0 5
5 6.0 6
6 10.0 10
7 10.0 10
8 14.0 14

Pandas: set all values that are <= 0 to the maximum value in a column by group, but only after the last positive value in that group

I am trying to set all values that are <= 0, by group, to the maximum value in that group, but only after the last positive value. That is, all values <=0 in the group that come before the last positive value must be ignored. Example:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
and the result must be:
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 8
5 B -1
6 B 0
7 B 9
8 B 9
9 B 9
10 B 9
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Thanks to advise
Start by adding a column to identify the rows with negative value (more precisely <= 0):
df['neg'] = (df['value'] <= 0)
Then, for each group, find the sequence of last few entries that have 'neg' set to True and that are contiguous. In order to do that, reverse the order of the DataFrame (with .iloc[::-1]) and then use .cumprod() on the 'neg' column. cumprod() will treat True as 1 and False as 0, so the cumulative product will be 1 as long as you're seeing all True's and will become and stay 0 as soon as you see the first False. Since we reversed the order, we're going backwards from the end, so we're finding the sequence of True's at the end.
df['upd'] = df.iloc[::-1].groupby('group')['neg'].cumprod().astype(bool)
Now that we know which entries to update, we just need to know what to update them to, which is the max of the group. We can use transform('max') on a groupby to get that value and then all that's left is to do the actual update of 'value' where 'upd' is set:
df.loc[df['upd'], 'value'] = df.groupby('group')['value'].transform('max')
We can finish by dropping the two auxiliary columns we used in the process:
df = df.drop(['neg', 'upd'], axis=1)
The result I got matches your expected result.
UPDATE: Or do the whole operation in a single (long!) line, without adding any auxiliary columns to the original DataFrame:
df.loc[
df.assign(
neg=(df['value'] <= 0)
).iloc[::-1].groupby(
'group'
)['neg'].cumprod().astype(bool),
'value'
] = df.groupby(
'group'
)['value'].transform('max')
You can do it this way.
(df.loc[(df.assign(m=df['value'].lt(0)).groupby(['group'], sort=False)['m'].transform('any')) &
(df.index>=df.groupby('group')['value'].transform('idxmin')),'value']) = np.nan
df['value']=df.groupby('group').ffill()
df
Output
group value
0 A 3.0
1 A 0.0
2 A 8.0
3 A 7.0
4 A 0.0
5 B -1.0
6 B 0.0
7 B 9.0
8 B 9.0
9 B 9.0
10 B 9.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0

Groupby, apply function to each row with shift, and create new column

I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0

Python Pandas add column with relative order numbers

How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.

Categories