Python - cumulative sum until sum matches an exact number - python

I have a column of numbers in a Python Pandas df: 1,8,4,3,1,5,1,4,2
If I create a cumulative sum column it returns the cumulative sum. How do I only return the rows that reaches a cumulative sum of 20 skipping numbers that take cumulative sum over 20?
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| f | 5 | 22 |
| g | 1 | 23 |
| h | 4 | 27 |
| i | 2 | 29 |
+-----+-------+------+
Desired output:
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| g | 1 | 18 |
| i | 2 | 20 |
+-----+-------+------+

If I understood your question correctly, you want only skip values that get you over cumulative sum of 20:
def acc(total):
s, rv = 0, []
for v, t in zip(total.index, total):
if s + t <= 20:
s += t
rv.append(v)
return rv
df = df[df.index.isin(acc(df.total))]
df['cumu'] = df.total.cumsum()
print(df)
Prints:
Var total cumu
0 a 1 1
1 b 8 9
2 c 4 13
3 d 3 16
4 e 1 17
6 g 1 18
8 i 2 20

Related

How to build sequence of purchases for each ID?

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

How to count the occurrence of a value and set that count as a new value for that value's row

Title is probably confusing, but let me make it clearer.
Let's say I have a df like this:
+----+------+---------------+
| Id | Name | reports_to_id |
+----+------+---------------+
| 0 | A | 10 |
| 1 | B | 10 |
| 2 | C | 11 |
| 3 | D | 12 |
| 4 | E | 11 |
| 10 | F | 20 |
| 11 | G | 21 |
| 12 | H | 22 |
+----+------+---------------+
I would want my resulting df to look like this:
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 0 |
| 1 | B | 10 | 0 |
| 2 | C | 11 | 0 |
| 3 | D | 12 | 0 |
| 4 | E | 11 | 0 |
| 10 | F | 20 | 2 |
| 11 | G | 21 | 2 |
| 12 | H | 22 | 1 |
+----+------+---------------+-------+
But this what I currently get as a result of my code (that is wrong):
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 2 |
| 1 | B | 10 | 2 |
| 2 | C | 11 | 2 |
| 3 | D | 12 | 1 |
| 4 | E | 11 | 2 |
| 10 | F | 20 | 0 |
| 11 | G | 21 | 0 |
| 12 | H | 22 | 0 |
+----+------+---------------+-------+
with this code:
df['COUNT'] = df.groupby(['reports_to_id'])['id'].transform('count')
Any suggestions or directions on how to get the result I want? All help is appreciated! and thank you in advance!
Use value_counts to count the reports_to_id by values, then map that to Id:
df['COUNT'] = df['Id'].map(df['reports_to_id'].value_counts()).fillna(0)
Output:
Id Name reports_to_id COUNT
0 0 A 10 0.0
1 1 B 10 0.0
2 2 C 11 0.0
3 3 D 12 0.0
4 4 E 11 0.0
5 10 F 20 2.0
6 11 G 21 2.0
7 12 H 22 1.0
Similar idea with reindex:
df['COUNT'] = df['reports_to_id'].value_counts().reindex(df['Id'], fill_value=0).values
which gives a better looking COUNT:
Id Name reports_to_id COUNT
0 0 A 10 0
1 1 B 10 0
2 2 C 11 0
3 3 D 12 0
4 4 E 11 0
5 10 F 20 2
6 11 G 21 2
7 12 H 22 1
You can try the following:
l=list[df['reports_to_id']
df['Count']=df['Id'].apply(lambda x: l.count(x))

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Select a row based on column value and its previous 2 rows

+---+---+---+---+----+
| A | B | C | D | E |
+---+---+---+---+----+
| 1 | 2 | 3 | 4 | VK |
| 1 | 4 | 6 | 9 | MD |
| 2 | 5 | 7 | 9 | V |
| 2 | 3 | 5 | 8 | VK |
| 2 | 3 | 7 | 9 | V |
| 1 | 1 | 1 | 1 | N |
| 0 | 1 | 6 | 9 | V |
| 1 | 2 | 5 | 7 | VK |
| 1 | 7 | 8 | 0 | MD |
| 1 | 5 | 7 | 9 | VK |
| 0 | 1 | 6 | 8 | V |
+---+---+---+---+----+
i want to select a row based on column value and its two previous rows. For example in the given dataset (on the picture) I want to select row based on 'E' column value 'VK' and two previous rows of that selected row. So we should get a dataset like this:
+---+---+---+---+----+
| A | B | C | D | E |
+---+---+---+---+----+
| 1 | 2 | 3 | 4 | VK |
| 1 | 4 | 6 | 9 | MD |
| 2 | 5 | 7 | 9 | V |
| 2 | 3 | 5 | 8 | VK |
| 2 | 3 | 7 | 9 | V |
| 1 | 1 | 1 | 1 | N |
| 1 | 2 | 5 | 7 | VK |
| 1 | 7 | 8 | 0 | MD |
| 1 | 5 | 7 | 9 | VK |
+---+---+---+---+----+
1st we need filter the dataframe until the last VK, then create the groupkey with cumsum , then do groupby head
df=df.loc[:df.E.eq('VK').loc[lambda x : x].index.max()]
df=df.iloc[::-1].groupby(df.E.eq('VK').iloc[::-1].cumsum()).head(3).sort_index()
df
Out[102]:
A B C D E
0 1 2 3 4 VK
1 1 4 6 9 MD
2 2 5 7 9 V
3 2 3 5 8 VK
5 1 1 1 1 N
6 0 1 6 9 V
7 1 2 5 7 VK
8 1 7 8 0 MD
9 1 5 7 9 VK

Use the other columns value if a condition is met Panda

Assuming I have the following table:
+----+---+---+
| A | B | C |
+----+---+---+
| 1 | 1 | 3 |
| 2 | 2 | 7 |
| 6 | 3 | 2 |
| -1 | 9 | 0 |
| 2 | 1 | 3 |
| -8 | 8 | 2 |
| 2 | 1 | 9 |
+----+---+---+
if column A's value is Negative, update column B's value by the value of column C. if not do nothing
This is the desired output:
+----+---+---+
| A | B | C |
+----+---+---+
| 1 | 1 | 3 |
| 2 | 2 | 7 |
| 6 | 3 | 2 |
| -1 | 0 | 0 |
| 2 | 1 | 3 |
| -8 | 2 | 2 |
| 2 | 1 | 9 |
+----+---+---+
I've been trying the following code but it's not working
#not working
result.loc(result["A"] < 0,result['B'] = result['C'].iloc[0])
result.B[result.A < 0] = result.C
Try this:
df.loc[df['A'] < 0, 'B'] = df['C']

Categories