I have a df as shown below
df:
Id gender age salary
1 m 27 100
2 m 26 100000
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 2000
From the above I would like to replace the value more than 80 percentile value with 80 percentile value.
Expected output:
Id gender age salary
1 m 27 100
2 m 26 560
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 560
Let's try:
quantiles = df.salary.quantile(0.8)
df.loc[df.salary > quantiles, 'salary'] = quantiles
Output (can't quite get 200 as .8 percentile though):
Id gender age salary
0 1 m 27 100.0
1 2 m 26 560.0
2 3 m 57 180.0
3 4 f 27 150.0
4 5 m 57 200.0
5 6 f 29 100.0
6 7 m 47 130.0
7 8 f 27 140.0
8 9 m 37 100.0
9 10 f 43 560.0
In case you want to fill within gender:
quantiles = df.groupby('gender')['salary'].transform('quantile', q=0.8)
Output:
Id gender age salary
0 1 m 27 100
1 2 m 26 200
2 3 m 57 180
3 4 f 27 150
4 5 m 57 200
5 6 f 29 100
6 7 m 47 130
7 8 f 27 140
8 9 m 37 100
9 10 f 43 890
Related
I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]
My code
import pandas as pd
import numpy as np
series = pd.read_csv('o1.csv', header=0)
s1 = series
s2 = series
s1['userID'] = series['userID'] + 5
s1['adID'] = series['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = series['adID'] + 4
r1=series.append(s1)
r2=r1.append(s2)
print(r2)
I got something wrong,now columns are exactly the same.
Output
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
3 12 f 107 50
4 12 f 108 100
5 13 m 109 62
6 13 m 114 28
7 13 m 108 36
8 12 f 109 74
9 12 f 114 100
10 14 m 108 62
11 14 m 109 28
12 15 f 116 50
13 15 f 117 100
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
I didn't want my series column to be changed.
Why did it happened?
How to change this?
Do I need to use iloc?
IIUC need copy if need new object DataFrame:
s1 = series.copy()
s2 = series.copy()
Sample:
print (df)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
s1 = df.copy()
s2 = df.copy()
s1['userID'] = df['userID'] + 5
s1['adID'] = df['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = df['adID'] + 4
r1=df.append(s1)
r2=r1.append(s2)
print(r2)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
0 16 m 110 50
1 16 m 111 100
2 16 m 112 0
0 21 m 111 50
1 21 m 112 100
2 21 m 113 0
I have pandas data frame of the format:
line_idno item_name \
sitename ts_placed order_idno
www.mattressesworld.co.uk 47 5242367 4112061 a
www.bedroomfurnitureworld.co.uk 47 5242295 4111977 b
5242295 4111979 a
5242295 4111978 v
5242295 4111980 a
www.mattressesworld.co.uk 47 5242300 4111986 b
www.bedroomfurnitureworld.co.uk 47 5242294 4111973 v
this has 3 indexes ('sitename','ts_placed','order_idno')
where:
'ts_placed' represents weeks of the year
'sitename' represents the name of the site
'order_idno' the order id number
and 5 columns('line_idno' 'item_name', 'item_qty', 'item_price', 'revenue').
Data was grouped on multiple level through these functions:
grouped = data.groupby(level=['sitename','ts_placed','order_idno']).sum()
grouped0 = grouped.dropna()
result:
line_idno item_qty \
sitename ts_placed order_idno
www.bedroomfurnitureworld.co.uk 38 5156953 3994322 1
5156956 3994325 1
5157144 3994580 1
5157191 3994641 0
5157198 3994651 1
5157217 3994678 2
5157218 3994679 2
5157233 3994697 1
5157257 7989463 2
What i want to obtain it the weekly break down of the mean revenue for each site . In other words, the sum of all revenues for each ts_placed/sitename group divided by the row counts for each ts_placed.
Here is a reproducible example.
import pandas as pd
import numpy as np
# simulate some artificial data
np.random.seed(1)
sites = ['www.{}.co.uk'.format(x) for x in 'AAA BBB CCC DDD EEE'.split()]
sitename = np.random.choice(sites, size=1000)
ts_placed = np.random.choice(np.arange(47, 53), size=1000)
order_idno = np.random.choice(np.arange(520000, 550000), size=1000)
item_name = np.random.choice('a b c d e f'.split(), size=1000)
line_idno = np.random.choice(np.arange(3960000, 4000000), size=1000)
item_qty = np.random.choice(np.arange(0, 10), size=1000)
item_price = np.random.choice(np.arange(1000, 10000), size=1000)
revenue = item_price * item_qty
data = pd.DataFrame(dict(sitename=sitename, ts_placed=ts_placed, order_idno=order_idno, item_name=item_name, line_idno=line_idno, item_qty=item_qty, item_price=item_price, revenue=revenue)).set_index(['sitename', 'ts_placed', 'order_idno'])
Out[17]:
item_name item_price item_qty line_idno revenue
sitename ts_placed order_idno
www.DDD.co.uk 47 526418 a 4514 1 3989144 4514
www.EEE.co.uk 52 539155 d 4952 5 3965922 24760
www.AAA.co.uk 52 539417 d 8816 0 3988185 0
www.BBB.co.uk 49 523800 b 3340 3 3981971 10020
www.DDD.co.uk 48 521464 f 4402 6 3976820 26412
www.AAA.co.uk 49 521706 c 8436 5 3963275 42180
52 544452 c 7220 8 3992357 57760
www.BBB.co.uk 50 548184 d 3389 9 3976608 30501
www.EEE.co.uk 49 527830 f 8110 1 3998527 8110
521908 a 7292 4 3964393 29168
www.BBB.co.uk 47 527558 b 4945 6 3977830 29670
www.CCC.co.uk 47 549572 f 3350 5 3988678 16750
www.EEE.co.uk 48 522511 f 1865 0 3992356 0
www.CCC.co.uk 51 520156 e 4717 8 3974344 37736
www.EEE.co.uk 50 534951 b 3738 9 3978519 33642
... ... ... ... ... ...
www.CCC.co.uk 50 525279 e 5961 0 3980873 0
www.DDD.co.uk 48 539486 c 2028 4 3978442 8112
www.EEE.co.uk 48 543216 e 3721 6 3986919 22326
www.BBB.co.uk 51 525662 c 1264 7 3987129 8848
www.CCC.co.uk 52 546208 e 7287 4 3999828 29148
www.AAA.co.uk 48 544288 a 7708 1 3974546 7708
www.DDD.co.uk 52 538708 f 9080 7 3983499 63560
www.CCC.co.uk 48 536774 a 8971 2 3968092 17942
www.BBB.co.uk 48 528310 c 3284 2 3985896 6568
www.AAA.co.uk 49 549547 c 4265 4 3960981 17060
544394 c 2268 8 3982739 18144
52 540515 f 4476 5 3987786 22380
www.EEE.co.uk 50 540388 f 1226 5 3980156 6130
47 522633 f 4185 5 3964986 20925
www.CCC.co.uk 49 532710 c 7462 2 3984676 14924
[1000 rows x 5 columns]
# your custom apply funciton
def apply_func(group):
avg_revenue = group.revenue.mean()
count_unique_order = len(group.order_idno.unique())
# or try this
# count_unique_order = group.order_idno.value_counts().count()
return pd.Series({'avg_revenue': avg_revenue, 'count_unique_order': count_unique_order})
# use the customized apply funciton
data.reset_index(level='order_idno').dropna().groupby(level=['sitename', 'ts_placed']).apply(apply_func)
Out[46]:
avg_revenue count_unique_order
sitename ts_placed
www.AAA.co.uk 47 23501.8158 10
48 23003.9355 10
49 24254.1212 10
50 23254.6410 10
51 19173.8966 10
52 23845.6786 10
www.BBB.co.uk 47 26136.0882 10
48 23007.3929 9
49 30605.2857 10
50 19530.3871 10
51 21768.6667 9
52 28387.5455 10
www.CCC.co.uk 47 28917.3448 9
48 23659.3488 10
49 26209.0625 9
50 22929.2564 10
51 23474.2857 9
52 22123.3429 10
www.DDD.co.uk 47 27176.2778 10
48 24530.6154 10
49 23601.8710 9
50 27749.2162 10
51 26816.0000 9
52 29910.5455 10
www.EEE.co.uk 47 27270.6471 10
48 23498.0789 10
49 26682.4250 10
50 24524.4400 10
51 15635.2500 10
52 20917.2500 10
I have the following dataframe:
In [4]:
df
Out[4]:
Symbol Date Strike C/P Bid Ask
0 GS 6/15/2015 200 c 5 72
1 GS 6/15/2015 200 p 5 72
2 GS 6/15/2015 210 c 15 0
3 GS 6/15/2015 210 p 15 54
4 GS 7/15/2015 200 c 20 50
5 GS 7/15/2015 200 p 20 0
6 GS 7/15/2015 210 c 4 90
7 GS 7/15/2015 210 p 4 90
8 IBM 6/15/2015 150 c 12 27
9 IBM 6/15/2015 150 p 12 0
10 IBM 6/15/2015 160 c 1 58
11 IBM 6/15/2015 160 p 1 3
12 IBM 7/15/2015 120 c 13 39
13 IBM 7/15/2015 120 p 13 39
14 IBM 7/15/2015 130 c 4 45
15 IBM 7/15/2015 130 p 4 45
and wish to filter out both c and p for a given strike if either of them has a 0 ask value as below:
Symbol Date Strike Call/Put Bid Ask yminx
GS 6/15/2015 200 c 5 72 90
GS 6/15/2015 200 p 5 72 90
GS 7/15/2015 210 c 4 90 90
GS 7/15/2015 210 p 4 90 90
IBM 6/15/2015 160 c 1 58 58
IBM 6/15/2015 160 p 1 3 58
IBM 7/15/2015 120 c 13 39 58
IBM 7/15/2015 120 p 13 39 58
IBM 7/15/2015 130 c 4 45 58
IBM 7/15/2015 130 p 4 45 58
I can filter by ask being 0 and remove that row by doing the following:
df = df[df.Ask != 0]
but I can not figure out how to remove the other row that has the same symbol/date/strike combination but a non zero ask.
any help would be greatly appreciated.
To filter out some rows, we need the 'filter' function instead of 'apply'.
by = df.groupby(['Symbol', 'Date', 'Strike'])
# this is used as filter function, returns a boolean type selector.
# pandas.groupby.filter() function would be smart enough to keep all those
# entry with True
def equal_to_45(group):
# return True if either Call or Put has an Ask = 45
return any(group.Ask.values == 45)
def keep_geq_45(group):
# return True if both Call or Put have an Ask great or equal to 45
# that is equivalent to delete all entries with Ask less than 45
return all(group.Ask.values >= 45)
# this time, use filter function instead of apply
by.filter(equal_to_45)
Out[242]:
Symbol Date Strike C/P Bid Ask
14 IBM 2015-07-15 130 c 4 45
15 IBM 2015-07-15 130 p 4 45
by.filter(keep_geq_45)
Out[243]:
Symbol Date Strike C/P Bid Ask
0 GS 2015-06-15 200 c 5 72
1 GS 2015-06-15 200 p 5 72
6 GS 2015-07-15 210 c 4 90
7 GS 2015-07-15 210 p 4 90
14 IBM 2015-07-15 130 c 4 45
15 IBM 2015-07-15 130 p 4 45
>>> mask = df.groupby(['Symbol', 'Date', 'Strike'])['Ask'].transform('all')
>>> df[~mask]
Symbol Date Strike C/P Bid Ask
2 GS 6/15/2015 210 c 15 0
3 GS 6/15/2015 210 p 15 54
4 GS 7/15/2015 200 c 20 50
5 GS 7/15/2015 200 p 20 0
8 IBM 6/15/2015 150 c 12 27
9 IBM 6/15/2015 150 p 12 0
so to remove these rows do df[mask].
I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K