I have data that looks like this:
d = {'id' : [1, 1, 1, 2, 2, 2],
'levels': ['low', 'perfect', 'high', 'low', 'perfect', 'high'],
'value': [1, 10, 13, 2, 10, 13]}
df = pd.DataFrame(d, columns=['id', 'levels', 'value'])
df = df.groupby(['id','levels'])[['value']].mean()
For each [id, levels], I want to find the difference between the value of the row and the value of the perfect row. It would look like this:
id | levels | value | penalty
1 | high | 13 | 3
| low | 1 | 9
| perfect| 10 | 0
2 | high | 13 | 3
| low | 2 | 8
| perfect| 10 | 0
For example, in the first row, you would subtract 13 from the perfect value, which is 10, to get 3.
So how do I make a calculation where I find the perfect value for each [id, levels], and then find the differences?
Select the cross section of dataframe using xs, then subtract this cross section from the given dataframe on level=0
df['penalty'] = df['value'].sub(df['value'].xs('perfect', level=1)).abs()
value penalty
id levels
1 high 13 3
low 1 9
perfect 10 0
2 high 13 3
low 2 8
perfect 10 0
You can try transform and then subtract and convert to absolute:
val = df.loc[df['levels'].eq('perfect').groupby(df['id']).transform('idxmax'),'value']
df['penalty'] = df['value'].sub(val.to_numpy()).abs()
print(df)
id levels value penalty
0 1 low 1 9
1 1 perfect 10 0
2 1 high 13 3
3 2 low 2 8
4 2 perfect 10 0
5 2 high 13 3
Related
I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)
If I have a python datatable like this:
from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])
how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's
data[, id:=1:.N, by=.(grp)]
This works, but seems completely ridiculous
data['id'] = np.concatenate(
[np.arange(x)
for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])
desired output:
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
Update:
Datatable now has a cumcount function in dev :
data[:, [f.value, dt.cumcount()], 'grp']
| grp value C0
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Old Answer:
datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.
One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:
from datatable import dt, f, by
import numpy as np
In [244]: def create_ranges(indices):
...: cum_length = indices.cumsum()
...: ids = np.ones(cum_length[-1], dtype=int)
...: ids[0] = 0
...: ids[cum_length[:-1]] = -1 * indices[:-1] + 1
...: return ids.cumsum()
counts = data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]
| grp value counts
| str32 int32 int64
-- + ----- ----- ------
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.
Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).
Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.
If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.
You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.
One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:
import datatable as dt
data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])
data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)
Output
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
I have a sales dataframe with customer data and sales team. I have a target calls based on the individual customer which I want to split across the sales teams
cust_id| total_calls_req| group_1_rep| group_2_rep| group_3_rep
34523 | 10 | 230429 | nan | 583985
34583 | 12 | 230429 | 539409 | 583985
34455 | 6 | 135552 | nan | nan
I want to create a function that splits the total_calls_req across each group based on whether or not there is a group_rep assigned.
If cust_id is assigned to 1 rep then the total_calls_req is all assigned to the rep in question
If cust_id is assigned to 2 reps then the total_calls_req is split between the two reps in question.
If cust_id is assigned to 3 reps then the total_calls_req is split randomly between the three reps in question and needs to be whole cards.
I want the end dataframe to look like this:
cust_id| total_calls_req| group_1_rep| group_2_rep| group_3_rep| group_1_rep_calls| group_2_rep_calls| group_3_rep_calls
34523 | 10 | 230429 | nan | 583985 | 5 | 0 | 5
34583 | 12 | 230429 | 539409 | 583985 | 6 | 3 | 3
34455 | 6 | 135552 | nan | nan | 6 | 0 | 0
Is there a way I can do that through a python function?
You can build a function which returns a serie with three elements according to the number of NaN values. I have based in this answer to get the Series, in that answer is used numpy.random.multinomial.
import numpy as np
def serie_split(row):
total_calls_req = row[0]
groups = row[1:]
numbers_nan = pd.notna(groups).sum()
if numbers_nan == len(groups):
s = pd.Series(np.random.multinomial(total_calls_req, [1/len(groups)] * len(groups)))
else:
s = pd.Series(groups)
s.loc[s.notna()] = total_calls_req/numbers_nan if (total_calls_req % 2) == 0 else np.random.multinomial(total_calls_req, [1/numbers_nan] * numbers_nan)
s.loc[s.isna()] = 0
return s
def get_rep_calls(df):
columns = df.filter(like='group_').add_suffix('_calls').columns
dfg = df[df.columns[1:]] # dfg is a dataframe only with the columns 'total_calls_req','group_1_rep', 'group_2_rep' and 'group_3_rep'
series = [serie_split(row) for row in dfg.to_numpy(dtype='object')]
for index in range(len(dfg)):
df.loc[index, columns] = series[index].values
get_rep_calls(df)
print(df)
Output (I have added an example in the last row with total_calls_req = 13):
cust_id
total_calls_req
group_1_rep
group_2_rep
group_3_rep
group_1_rep_calls
group_2_rep_calls
group_3_rep_calls
34523
10
230429
NaN
583985
5
0
5
34583
12
230429
539409
583985
4
7
1
34455
6
135552
NaN
NaN
6
0
0
12345
13
123456
NaN
583985
10
0
3
You can use this custom split function to split the calls among reps that are assigned to the customer. This uses the columns starting in "group_" to identify the assigned reps and count how many they are. In case they are more than two, the numpy.random.multinomial function enables to generate a random split.
import numpy as np
def split(s):
reps = (~s.filter(like='group_').isna()).astype(int).add_suffix('_calls')
total = reps.sum()
if total > 2: # remove this line and below for a better split across reps
return np.random.multinomial(s['total_calls_req'], [1/total]*total)
div, mod = divmod(int(s['total_calls_req']), total)
reps = reps*div # split evenly
reps[np.random.choice(np.flatnonzero(reps), mod)]+=1 # allocate remainder randomly
return reps
pd.concat([df, df.apply(split, axis=1)], axis=1)
output:
cust_id total_calls_req group_1_rep group_2_rep group_3_rep group_1_rep_calls group_2_rep_calls group_3_rep_calls
0 34523 10 230429 NaN 583985.0 5 0 5
1 34583 12 230429 539409.0 583985.0 4 2 6
2 34455 6 135552 NaN NaN 6 0 0
I have the following grid with bins defined by x and y, and each grid square given a unique id
mapping = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9], 'x': [1,1,1,2,2,2,3,3,3], 'y': [1,2,3,1,2,3,1,2,3]})
id x y
0 1 1 1
1 2 1 2
2 3 1 3
3 4 2 1
4 5 2 2
5 6 2 3
6 7 3 1
7 8 3 2
8 9 3 3
I also have a new dataframe of observations for which I would like to know what their associated id should be (e.g. which grid square they fall into)
coordinates = pd.DataFrame({'x': [1.4, 2.7], 'y': [1.9, 1.1]})
x y
0 1.4 1.9
1 2.7 1.1
My solution is the following function:
import bisect
def get_id(coords, mapping):
x_val = mapping.x[bisect.bisect_right(mapping.x, coords[0]) - 1]
y_val = mapping.y[bisect.bisect_right(mapping.y, coords[1]) - 1]
id = mapping[(mapping.x == x_val) & (mapping.y == y_val)].iloc[0, 0]
return id
coordinates.apply(get_id, mapping = mapping, axis = 1)
Out[21]:
0 1
1 4
dtype: int64
This works but becomes slow as the coordinates dataframe grows long. I'm sure there is a fast way to do this for a coordinates dataframe with 10^6 + observations. Is there a faster way to do this?
Edit:
To answer #abdurrehman245 question from the comments below.
My current method is to simply round down any data point, this allows me to map it to an ID by using the mapping dataframe which contains the min entries (the bins) for any given ID. So x=1.4 y=1.9 round to x=1 y=1 which is mapped to id=1 according to the mapping.
Maybe this cartesian visualisation makes this a little bit more clear:
Y
4 -------------------------
| 3 | 6 | 9 |
| | | |
3 -------------------------
| 2 | 5 | 8 |
| | | |
2 -------------------------
| 1 | 4 | 7 |
| | | |
1 ------------------------- X
1 2 3 4
I would also add that I could not use the floor function as the bins are not necessarily nice integers as in this example.
I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.