how to get the count inside groupby function in python - python

I have pandas data frame of the format:
line_idno item_name \
sitename ts_placed order_idno
www.mattressesworld.co.uk 47 5242367 4112061 a
www.bedroomfurnitureworld.co.uk 47 5242295 4111977 b
5242295 4111979 a
5242295 4111978 v
5242295 4111980 a
www.mattressesworld.co.uk 47 5242300 4111986 b
www.bedroomfurnitureworld.co.uk 47 5242294 4111973 v
this has 3 indexes ('sitename','ts_placed','order_idno')
where:
'ts_placed' represents weeks of the year
'sitename' represents the name of the site
'order_idno' the order id number
and 5 columns('line_idno' 'item_name', 'item_qty', 'item_price', 'revenue').
Data was grouped on multiple level through these functions:
grouped = data.groupby(level=['sitename','ts_placed','order_idno']).sum()
grouped0 = grouped.dropna()
result:
line_idno item_qty \
sitename ts_placed order_idno
www.bedroomfurnitureworld.co.uk 38 5156953 3994322 1
5156956 3994325 1
5157144 3994580 1
5157191 3994641 0
5157198 3994651 1
5157217 3994678 2
5157218 3994679 2
5157233 3994697 1
5157257 7989463 2
What i want to obtain it the weekly break down of the mean revenue for each site . In other words, the sum of all revenues for each ts_placed/sitename group divided by the row counts for each ts_placed.

Here is a reproducible example.
import pandas as pd
import numpy as np
# simulate some artificial data
np.random.seed(1)
sites = ['www.{}.co.uk'.format(x) for x in 'AAA BBB CCC DDD EEE'.split()]
sitename = np.random.choice(sites, size=1000)
ts_placed = np.random.choice(np.arange(47, 53), size=1000)
order_idno = np.random.choice(np.arange(520000, 550000), size=1000)
item_name = np.random.choice('a b c d e f'.split(), size=1000)
line_idno = np.random.choice(np.arange(3960000, 4000000), size=1000)
item_qty = np.random.choice(np.arange(0, 10), size=1000)
item_price = np.random.choice(np.arange(1000, 10000), size=1000)
revenue = item_price * item_qty
data = pd.DataFrame(dict(sitename=sitename, ts_placed=ts_placed, order_idno=order_idno, item_name=item_name, line_idno=line_idno, item_qty=item_qty, item_price=item_price, revenue=revenue)).set_index(['sitename', 'ts_placed', 'order_idno'])
Out[17]:
item_name item_price item_qty line_idno revenue
sitename ts_placed order_idno
www.DDD.co.uk 47 526418 a 4514 1 3989144 4514
www.EEE.co.uk 52 539155 d 4952 5 3965922 24760
www.AAA.co.uk 52 539417 d 8816 0 3988185 0
www.BBB.co.uk 49 523800 b 3340 3 3981971 10020
www.DDD.co.uk 48 521464 f 4402 6 3976820 26412
www.AAA.co.uk 49 521706 c 8436 5 3963275 42180
52 544452 c 7220 8 3992357 57760
www.BBB.co.uk 50 548184 d 3389 9 3976608 30501
www.EEE.co.uk 49 527830 f 8110 1 3998527 8110
521908 a 7292 4 3964393 29168
www.BBB.co.uk 47 527558 b 4945 6 3977830 29670
www.CCC.co.uk 47 549572 f 3350 5 3988678 16750
www.EEE.co.uk 48 522511 f 1865 0 3992356 0
www.CCC.co.uk 51 520156 e 4717 8 3974344 37736
www.EEE.co.uk 50 534951 b 3738 9 3978519 33642
... ... ... ... ... ...
www.CCC.co.uk 50 525279 e 5961 0 3980873 0
www.DDD.co.uk 48 539486 c 2028 4 3978442 8112
www.EEE.co.uk 48 543216 e 3721 6 3986919 22326
www.BBB.co.uk 51 525662 c 1264 7 3987129 8848
www.CCC.co.uk 52 546208 e 7287 4 3999828 29148
www.AAA.co.uk 48 544288 a 7708 1 3974546 7708
www.DDD.co.uk 52 538708 f 9080 7 3983499 63560
www.CCC.co.uk 48 536774 a 8971 2 3968092 17942
www.BBB.co.uk 48 528310 c 3284 2 3985896 6568
www.AAA.co.uk 49 549547 c 4265 4 3960981 17060
544394 c 2268 8 3982739 18144
52 540515 f 4476 5 3987786 22380
www.EEE.co.uk 50 540388 f 1226 5 3980156 6130
47 522633 f 4185 5 3964986 20925
www.CCC.co.uk 49 532710 c 7462 2 3984676 14924
[1000 rows x 5 columns]
# your custom apply funciton
def apply_func(group):
avg_revenue = group.revenue.mean()
count_unique_order = len(group.order_idno.unique())
# or try this
# count_unique_order = group.order_idno.value_counts().count()
return pd.Series({'avg_revenue': avg_revenue, 'count_unique_order': count_unique_order})
# use the customized apply funciton
data.reset_index(level='order_idno').dropna().groupby(level=['sitename', 'ts_placed']).apply(apply_func)
Out[46]:
avg_revenue count_unique_order
sitename ts_placed
www.AAA.co.uk 47 23501.8158 10
48 23003.9355 10
49 24254.1212 10
50 23254.6410 10
51 19173.8966 10
52 23845.6786 10
www.BBB.co.uk 47 26136.0882 10
48 23007.3929 9
49 30605.2857 10
50 19530.3871 10
51 21768.6667 9
52 28387.5455 10
www.CCC.co.uk 47 28917.3448 9
48 23659.3488 10
49 26209.0625 9
50 22929.2564 10
51 23474.2857 9
52 22123.3429 10
www.DDD.co.uk 47 27176.2778 10
48 24530.6154 10
49 23601.8710 9
50 27749.2162 10
51 26816.0000 9
52 29910.5455 10
www.EEE.co.uk 47 27270.6471 10
48 23498.0789 10
49 26682.4250 10
50 24524.4400 10
51 15635.2500 10
52 20917.2500 10

Related

Replace values more than 80 percentile with 80 percentile in pandas

I have a df as shown below
df:
Id gender age salary
1 m 27 100
2 m 26 100000
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 2000
From the above I would like to replace the value more than 80 percentile value with 80 percentile value.
Expected output:
Id gender age salary
1 m 27 100
2 m 26 560
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 560
Let's try:
quantiles = df.salary.quantile(0.8)
df.loc[df.salary > quantiles, 'salary'] = quantiles
Output (can't quite get 200 as .8 percentile though):
Id gender age salary
0 1 m 27 100.0
1 2 m 26 560.0
2 3 m 57 180.0
3 4 f 27 150.0
4 5 m 57 200.0
5 6 f 29 100.0
6 7 m 47 130.0
7 8 f 27 140.0
8 9 m 37 100.0
9 10 f 43 560.0
In case you want to fill within gender:
quantiles = df.groupby('gender')['salary'].transform('quantile', q=0.8)
Output:
Id gender age salary
0 1 m 27 100
1 2 m 26 200
2 3 m 57 180
3 4 f 27 150
4 5 m 57 200
5 6 f 29 100
6 7 m 47 130
7 8 f 27 140
8 9 m 37 100
9 10 f 43 890

Assign a value of 1 when another variable was equal 1 at the last time

I have this data frame:
ID Date X1 X2 Y
A 16-07-19 58 50 0
A 17-07-19 61 83 1
A 18-07-19 97 38 0
A 19-07-19 29 77 0
A 20-07-19 66 71 1
A 21-07-19 28 74 0
B 19-07-19 54 65 1
B 20-07-19 55 32 1
B 21-07-19 50 30 0
B 22-07-19 51 38 0
B 23-07-19 81 61 0
C 24-07-19 55 29 0
C 25-07-19 97 69 1
C 26-07-19 92 44 1
C 27-07-19 55 97 0
C 28-07-19 13 48 1
D 29-07-19 77 27 1
D 30-07-19 68 50 1
D 31-07-19 71 32 1
D 01-08-19 89 57 1
D 02-08-19 46 70 0
D 03-08-19 14 68 1
D 04-08-19 12 87 1
D 05-08-19 56 13 0
E 06-08-19 47 35 1
I want to create a variable that equals 1 when Y was equal 1 at the last time (for each ID), and 0 otherwise.
Also, to exclude all the rows that come after the last time Y was equal 1.
Expected result:
ID Date X1 X2 Y Last
A 16-07-19 58 50 0 0
A 17-07-19 61 83 1 0
A 18-07-19 97 38 0 0
A 19-07-19 29 77 0 0
A 20-07-19 66 71 1 1
B 19-07-19 54 65 1 0
B 20-07-19 55 32 1 1
C 24-07-19 55 29 0 0
C 25-07-19 97 69 1 0
C 26-07-19 92 44 1 0
C 27-07-19 55 97 0 0
C 28-07-19 13 48 1 1
D 29-07-19 77 27 1 0
D 30-07-19 68 50 1 0
D 31-07-19 71 32 1 0
D 01-08-19 89 57 1 0
D 02-08-19 46 70 0 0
D 03-08-19 14 68 1 0
D 04-08-19 12 87 1 1
E 06-08-19 47 35 1 1
First remove all rows after last 1 in Y with compare Y with swap order and GroupBy.cumsum, then get all rows not equal by 0 and filter in boolean indexing, last use
numpy.where for new column:
df = df[df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()]
df['Last'] = np.where(df['ID'].duplicated(keep='last'), 0, 1)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
24 E 06-08-19 47 35 1 1
EDIT:
m = df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()
df['Last'] = np.where(m.ne(m.groupby(df['ID']).shift(-1)) & m,1,0)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
5 A 21-07-19 28 74 0 0
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
8 B 21-07-19 50 30 0 0
9 B 22-07-19 51 38 0 0
10 B 23-07-19 81 61 0 0
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
23 D 05-08-19 56 13 0 0
24 E 06-08-19 47 35 1 1

Incremental assignment in pandas dataframe to determine month from week number without date element

I'm having week numbers in the dataframe from 1 to 52 e.g. [1,2,3,4,5,6,7,8,..52]
I'm trying to create a new column for month but it would mean an incremental assignment like [1,2,3,4] = 1, [5,6,7,8] = 2, .. [49,50,51,52] = 12
I tried getting the records by multiple of 4 using df[df["week"]%4==0] and then ffill it but seems like we can only assign it all to the same number which is not what I want. Instead I want to assign [1..12] accordingly. Is there another way to do this?
Subtract 1 first and then use integer division by 4:
df = pd.DataFrame({'week':range(1,53)})
df['new'] = (df["week"] - 1)//4
print (df.head(10))
week new
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 1
6 7 1
7 8 1
8 9 2
9 10 2
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
If want starting by 1 it is possible, but last value is 13:
df['new'] = ((df["week"] - 1)//4) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
print (df.tail(10))
week new
42 43 11
43 44 11
44 45 12
45 46 12
46 47 12
47 48 12
48 49 13
49 50 13
50 51 13
51 52 13
If want values between 1 and 12 (but some groups has more like 4 values) use, solution by #Aryerez, thank you:
df['new'] = ((df["week"] - 1) // (52 / 12)).astype(int) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 2
6 7 2
7 8 2
8 9 2
9 10 3
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
EDIT: For 5 values in each 3rd group use:
df['new'] = ((df["week"] + 4) // (52 / 12)).astype(int)
print (df.head(15))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
10 11 3
11 12 3
12 13 3
13 14 4
14 15 4
print (df.tail(15))
week new
37 38 9
38 39 9
39 40 10
40 41 10
41 42 10
42 43 10
43 44 11
44 45 11
45 46 11
46 47 11
47 48 12
48 49 12
49 50 12
50 51 12
51 52 12

How to randomly drop rows in Pandas dataframe until there are equal number of values in a column?

I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]

Pandas dataframe from nested dictionary to melted data frame

I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K

Categories