Bar plot in python for categorical data - python

I am trying to create a bar for one of the column in dataset.
Column name is glucose and need a bar plot for three categoric values 0-100, 1-150, 151-200.
X=dataset('Glucose')
X.head(20)
0 148
1 85
2 183
3 89
4 137
5 116
6 78
7 115
8 197
9 125
10 110
11 168
12 139
13 189
14 166
15 100
16 118
17 107
18 103
19 115
not sure which approach to follow. could anyone please guide.

You can use pd.cut (Assuming X is a series) with value_counts:
pd.cut(X,[0,100,150,200]).value_counts().plot.bar()

bins=pd.IntervalIndex.from_tuples([(0,100),(101,150),(151,200)])

Related

concat result of apply in python

I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4

Pandas : reverse cumulative sum's process

I've got a simple pandas' Serie, like this one :
st
0 74
1 91
2 105
3 121
4 136
5 157
Datas for this Serie are the result of a cumulative sum, so I was wondering if a pandas function could "undo" the process, and return a new Serie like :
st result
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21
result[0] = st[0], but after result[i] = st[i]-st[i-1].
It's seemed to be very simple (and maybe I missed a post), but I didn't find anything...
Use Series.diff with replace first missing value by original by Series.fillna and then if necessary cast to integers:
df['res'] = df['st'].diff().fillna(df['st']).astype(int)
print (df)
st res
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21

How to re order rows, by moving multiple separated rows an X amount of rows below in python with either pandas or numpy

I have a very long dataframe with hundreds of rows. I want to select the rows with one key word in one of the columns, and lower the whole row 18 places below. Since there are too many, using reindex and doing it manually would be too long.
As an example, for this df I would like to move the rows with the word "Base" in Column A, three rows below, after "Three" :
A B C
Base 572 55
One 654 196
Two 2 156
Three 154 123
Base 78 45
One 251 78
Two 5 56
Three 321 59
Base 48 45
One 5 12
Two 531 231
Three 51 123
So, I want it to look like:
A B C
One 654 196
Two 2 156
Three 154 123
Base 572 55
One 251 78
Two 5 56
Three 321 59
Base 78 45
One 5 12
Two 531 231
Three 51 123
Base 48 45
I am new at programming, so would appreciate your help!
First create extra, dummy column, to mock your sorting key. In this case, as far as I understood you:
ord=["One", "Two", "Three", "Base"]
df["sorting_key"]=df.groupby("A").cumcount().map(str)+":"+df["A"].apply(ord.index).map(str)
Then just sort by it:
df.sort_values("sorting_key")
Result:
A B C sorting_key
1 One 654 196 0:0
2 Two 2 156 0:1
3 Three 154 123 0:2
0 Base 572 55 0:3
5 One 251 78 1:0
6 Two 5 56 1:1
7 Three 321 59 1:2
4 Base 78 45 1:3
9 One 5 12 2:0
10 Two 531 231 2:1
11 Three 51 123 2:2
8 Base 48 45 2:3
Then in order to reindex it, and drop the dummy column:
df.sort_values("sorting_key").reset_index(drop=True).drop(columns="sorting_key")
Output:
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45
You could do the following:
# create mask for identifying Base
mask = df.A.eq("Base")
# create index with non base values
non_base = df[~mask].reset_index(drop=True) # reset index
# create DataFrame with Base values
base = df[mask]
base.index = base.index + (3 - np.arange(len(base))) # change index to reflect new indices in result
# concat and sort by index
result = pd.concat([base, non_base], sort=True).sort_index().reset_index(drop=True)
print(result)
Output
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45

python pandas: Grouping dataframe by ranges

I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()

Row wise outlier detection in python

I have the CSV data as follows:
A_ID P_ID 1429982904 1430370002 1430974801 1431579602 1432184403 1432789202 1435208402 1435308653
11Jgipc qjMakF 364 365 363 363 364 364 364 367
11Jgipc qxL8FJ 18 18 18 18 18 18 18 18
11Jgipc r0Bpnt 40 40 41 41 41 42 42 42
11Jgipc roLk4N 140 140 143 143 146 147 147 149
11Jgipc tOudhM 12 13 13 13 13 13 14 14
11Jgipc u-x6o8 678 678 688 688 689 690 692 695
11Jgipc u5HHmV 1778 1785 1811 1811 1819 1826 1834 1836
11Jgipc ufrVoP 67 67 67 67 67 67 67 67
11Jgipc vRqMK4 36 36 34 34 34 34 34 34
11Jgipc wbdj-C 31 33 35 35 36 36 36 37
11Jgipc xtRiw3 6 6 6 6 6 6 6 6
What I want to do is, find outliers in each row.
About the data:
The column headers apart from A_ID and P_IDare timestamps. So for each pair of A_ID and P_ID (i.e. in a row), set of values are present. So each row can be considered as a time-series.
Expected Output:
For each row, probably the tuple(s) in the form [(A_ID,PID):(Value, ColumnHeader),.....]
What I have tried:
I have tried as per the suggestions given in this solution.
The simplest solution of finding mean and standard deviation first, followed by finding outliers which are K-times standard deviation and above mean did not work as for each row the value of K differs.
Even the moving average method seems to be not appropriate for this case, because for every row the constraint would differ.
Manually setting such constraint is not an option as the number of rows are large and so do the number of such files I want to find outliers for.
What could be better options as per my understanding:
Using Scikit Learn - "Outlier detection with several methods".
If yes, how can I do it?
Any other specific package? May be in Pandas? if so, how can I do it?
Any example, help or suggestion would be much appreciated.

Categories