Applying function to groups in dataframe, excluding current row value - python

Let's say I have the following data:
day
query
num_searches
1
abc
2
1
def
3
2
abc
6
3
abc
5
4
def
1
4
abc
3
5
abc
7
6
abc
8
7
abc
10
8
abc
1
I'd like to generate z-score (excluding the current row's value) such that for query 'abc':
Day 1: [6, 5, 3, 7, 8, 10, 1] (exclude the value 2) zscore = -1.32
Day 2: [2, 5, 3, 7, 8, 10, 1] (exclude the value 6) zscore = 0.28
...
Day 7: [2, 6, 5, 3, 7, 8, 1] (exclude the value 10) zscore = 2.22
Day 8: [2, 6, 5, 3, 7, 8, 10] (exclude the value 1) zscore = -1.88
I have the following function to calculate this 'exclusive' zscore.
def zscore_exclusive(arr):
newl = []
for index, val in enumerate(x):
l = list(x)
val = l.pop(index)
arr_popped = np.array(l)
avg = np.mean(arr_popped)
stdev = np.std(arr_popped)
newl.append((val - avg) / stdev)
return np.array(newl)
How can I apply this custom function to each grouping (by query string)? Remember, I'd like to pop the currently evaluated element from the series.

Given:
day query num_searches
0 1 abc 2
1 1 def 3
2 2 abc 6
3 3 abc 5
4 4 def 1
5 4 abc 3
6 5 abc 7
7 6 abc 8
8 7 abc 10
9 8 abc 1
Doing:
Note!
For np.std, ddof = 0 by default.
But for pd.Series.std, ddof = 1 by default.
You should be sure which one you want to use.
z_score = lambda x: [(x[i]-x.drop(i).mean())/x.drop(i).std(ddof=0) for i in x.index]
df['z-score'] = df.groupby('query')['num_searches'].transform(z_score)
print(df)
Output:
day query num_searches z-score
0 1 abc 2 -1.319950
1 1 def 3 inf
2 2 abc 6 0.277350
3 3 abc 5 -0.092057
4 4 def 1 -inf
5 4 abc 3 -0.866025
6 5 abc 7 0.661438
7 6 abc 8 1.083862
8 7 abc 10 2.223782
9 8 abc 1 -1.877336

Related

How to group pandas df by multiple conditions, take the mean and append to df?

I have a df looking something like this:
df = pd.DataFrame({
'Time' : [1,2,7,10,15,16,77,98,999,1000,1121,1245,1373,1490,1555],
'Act_cat' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4],
'Count' : [6, 2, 4, 1, 2, 1, 8, 4, 3, 1, 4, 13, 3, 1, 2],
'Moving': [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1]})
I would like to group by same values in "Act_cat" following each other and by "Moving" ==1 and for these groups take the mean of the "count" column and map it back onto the df.
I have tried something below but here all rows of the "Count" column where averaged and not only the ones where "moving" ==1.
group1 = (df['moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
mean_values = df.groupby(group1)["Count"].mean()
df['newcol'] = group1.map(mean_values)
Please let me know how I could solve this!
Thank you,
Tahnee
IIUC use:
group1 = (df['Moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
mean_values = df[df['Moving'].eq(1)].groupby(group1)["Count"].mean()
df['newcol'] = group1.map(mean_values)
Alterntive solution:
group1 = (df['Moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
df['newcol'] = df['Count'].where(df['Moving'].eq(1)).groupby(group1).transform('mean')
print (df)
Time Act_cat Count Moving newcol
0 1 1 6 1 4.6
1 2 1 2 0 4.6
2 7 1 4 1 4.6
3 10 1 1 0 4.6
4 15 1 2 1 4.6
5 16 2 1 0 4.6
6 77 2 8 1 4.6
7 98 2 4 0 4.6
8 999 2 3 1 4.6
9 1000 2 1 0 4.6
10 1121 4 4 1 3.0
11 1245 4 13 0 3.0
12 1373 4 3 1 3.0
13 1490 4 1 0 3.0
14 1555 4 2 1 3.0

Create row values in a column with data from previous row in Python Pandas

I am trying to create a new column in which the value in the first row is 0 and from the second row, it should do a calculation as mentioned below which is
ColumnA[This row] = (ColumnA[Last row] * 13 + ColumnB[This row])/14
I am using the python pandas shift function but it doesn't seem to be producing the intended result.
test = np.array([ 1, 5, 3, 20, 2, 6, 9, 8, 7])
test = pd.DataFrame(test, columns = ['ABC'])
test.loc[test['ABC'] == 1, 'a'] = 0
test['a'] = (test['a'].shift()*13 + test['ABC'])/14
I am trying to create a column that looks like this
ABC
a
1
0
5
0.3571
3
0.5459
20
1.9355
2
1.9401
6
2.2301
9
2.7137
8
3.0913
7
3.3705
But actually what I am getting by running the above code is this
ABC
a
1
nan
2
0
3
nan
4
nan
5
nan
6
nan
7
nan
8
nan
9
nan
test = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9])
test = pd.DataFrame(test, columns = ['ABC'])
test["res"] = test["ABC"]
test.iloc[0]['res'] = 0 # Initialize the first row as 0
test["res"] = test.res + test.res.shift()
test["res"] = test.res.fillna(0).astype(int) # test.res.shift() introduces a nan and we replace it with a 0 and convert the column data type to int
Try:
test["a"] = (test["ABC"].shift().cumsum() + test["ABC"].shift()).fillna(0)
print(test)
Prints:
ABC a
0 1 0.0
1 2 2.0
2 3 5.0
3 4 9.0
4 5 14.0
5 6 20.0
6 7 27.0
7 8 35.0
8 9 44.0
Let's try a for loop
import pandas as pd
df = pd.DataFrame({'ABC': [1, 5, 3, 20, 2, 6, 9, 8, 7]})
lst = [0]
res = 0
for i, row in df.iloc[1:].iterrows():
res = ((res * 13) + row['ABC']) / 14
lst.append(res)
df['a'] = pd.Series(lst)
print(df)
Output:
ABC a
0 1 0.000000
1 5 0.357143
2 3 0.545918
3 20 1.935496
4 2 1.940103
5 6 2.230096
6 9 2.713660
7 8 3.091256
8 7 3.370452

How to get a new column with sum of even/odd/kth rows?

I have this pandas series:
ts = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
What I would like to get is a dataframe which contains another column with the sum of rows 0, 2, 4, 6 and for 1, 3, 5 and 7 (that means, one row is left out when creating the sum).
In this case, this means a new dataframe should look like this one:
index ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
How could I do this?
Use modulo by k for each kth rows:
k = 2
df = ts.to_frame('ts')
df['sum'] = df.groupby(ts.index % k).transform('sum')
#if not default RangeIndex
#df['sum'] = df.groupby(np.arange(len(ts)) % k).transform('sum')
print (df)
ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20

column index out of range with a data frame

I am currently training a neural network. I would like to split up my training and validation data with a 80:20 ratio. I would like to have the full purchases.
Unfortunately, I get an IndexError: column index (12) out of range. How can I fix this? At this position the error occurs mat[purchaseid, itemid] = 1.0. So I always split after each purchase (a complete purchase = if I have all rows with the same purchaseid).
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
print(df.head(20))
Methods:
PERCENTAGE_SPLIT = 20
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe.shape[0], len(dataframe['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Call:
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat_ = generate_matrix(df_tr, 'train')
val_mat_ = generate_matrix(df_val, 'val')
Error:
IndexError: column index (12) out of range
Dataframe:
#df
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
15 6 0
16 6 5
17 6 12
18 7 9
19 7 9
# df_tr
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
18 7 9
19 7 9
20 8 13
# df_val
purchaseid itemid
15 6 0
16 6 5
17 6 12
21 9 1
22 9 7
23 9 11
24 9 11
Try this instead. sp.dok_matrix requires dimensions of target matrix. I have assumed range of purchaseid to be within [ 0, max(purchaseid) ] and range of itemid to be within [ 0, max(itemid) ] looking at your data.
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe['purchaseid'].max() + 1, dataframe['itemid'].max() + 1), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat

Python-way time series transform

Good day!
There is the following time series dataset:
Time Value
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 4
11 4
12 5
I need to split and group data by value like this:
Value Time start, Time end
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
How to do it fast and in the most functional programming style on python? Various libraries can be used for example pandas, numpy.
Try with pandas:
df.groupby('Time')['Value'].agg(['min','max'])
We can use pandas for this:
Solution:
data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Value': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5]
}
df = pd.DataFrame(data, columns= ['Time', 'Value'])
res = df.groupby('Value').agg(['min', 'max'])
f_res = res.rename(columns = {'min': 'Start Time', 'max': 'End Time'}, inplace = False)
print(f_res)
Output:
Time
Start Time End Time
Value
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
first get the count of Values
result = df.groupby('Value').agg(['count'])
result.columns = result.columns.get_level_values(1) #drop multi-index
result
count
Value
1 3
2 4
3 2
4 2
5 1
then cumcount to get time start
s = df.groupby('Value').cumcount()
result["time start"] = s[s == 0].index.tolist()
result
count time start
Value
1 3 0
2 4 3
3 2 7
4 2 9
5 1 11
finally,
result["time start"] += 1
result["time end"] = result["time start"] + result['count'] - 1
result
count time start time end
Value
1 3 1 3
2 4 4 7
3 2 8 9
4 2 10 11
5 1 12 12

Categories