Pandas: sum multiple columns based on similar consecutive numbers in another column - python

Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?

A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1

Related

Filter rows based on condition in Pandas

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).
You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

Binning Pandas value_counts

I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Filter all rows from groupby object

I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]

Parsing out indeces and values from pandas multi index dataframe

I have a dataframe in a similar format to this:
+--------+--------+----------+------+------+------+------+
| | | | | day1 | day2 | day3 |
+--------+--------+----------+------+------+------+------+
| id_one | id_two | id_three | date | | | |
| 18273 | 50 | 1 | 3 | 9 | 11 | 3 |
| | | | 4 | 26 | 27 | 68 |
| | | | 5 | 92 | 25 | 4 |
| | | | 6 | 60 | 72 | 83 |
| | 60 | 2 | 5 | 69 | 93 | 84 |
| | | | 6 | 69 | 30 | 12 |
| | | | 7 | 65 | 65 | 59 |
| | | | 8 | 57 | 88 | 59 |
| | 70 | 3 | 5 | 22 | 95 | 7 |
| | | | 6 | 40 | 24 | 20 |
| | | | 7 | 73 | 81 | 57 |
| | | | 8 | 43 | 8 | 66 |
+--------+--------+----------+------+------+------+------+
I am trying to create tuple that contains id_one, id_two and the values that each grouping contains.
To test this, I am simply trying to print the ids and values like this:
for id_two, data in df.head(100).groupby(level='id_two'):
print id_two, data.values.ravel()
Which gives me the id_two and the data exactly as it should.
I am running into problems when I try and incorporate id_one. I tried this, but was met with an error ValueError: need more than 2 values to unpack
for id_one, id_two, data in df.head(100).groupby(level='id_two'):
print id_one, id_two, data.values.ravel()
How can I print id_one, id_two and the data?
You can pass a list of columns into the level parameter:
df.head.groupby(level=['id_one', 'id_two'])

Categories