Row wise outlier detection in python - python

I have the CSV data as follows:
A_ID P_ID 1429982904 1430370002 1430974801 1431579602 1432184403 1432789202 1435208402 1435308653
11Jgipc qjMakF 364 365 363 363 364 364 364 367
11Jgipc qxL8FJ 18 18 18 18 18 18 18 18
11Jgipc r0Bpnt 40 40 41 41 41 42 42 42
11Jgipc roLk4N 140 140 143 143 146 147 147 149
11Jgipc tOudhM 12 13 13 13 13 13 14 14
11Jgipc u-x6o8 678 678 688 688 689 690 692 695
11Jgipc u5HHmV 1778 1785 1811 1811 1819 1826 1834 1836
11Jgipc ufrVoP 67 67 67 67 67 67 67 67
11Jgipc vRqMK4 36 36 34 34 34 34 34 34
11Jgipc wbdj-C 31 33 35 35 36 36 36 37
11Jgipc xtRiw3 6 6 6 6 6 6 6 6
What I want to do is, find outliers in each row.
About the data:
The column headers apart from A_ID and P_IDare timestamps. So for each pair of A_ID and P_ID (i.e. in a row), set of values are present. So each row can be considered as a time-series.
Expected Output:
For each row, probably the tuple(s) in the form [(A_ID,PID):(Value, ColumnHeader),.....]
What I have tried:
I have tried as per the suggestions given in this solution.
The simplest solution of finding mean and standard deviation first, followed by finding outliers which are K-times standard deviation and above mean did not work as for each row the value of K differs.
Even the moving average method seems to be not appropriate for this case, because for every row the constraint would differ.
Manually setting such constraint is not an option as the number of rows are large and so do the number of such files I want to find outliers for.
What could be better options as per my understanding:
Using Scikit Learn - "Outlier detection with several methods".
If yes, how can I do it?
Any other specific package? May be in Pandas? if so, how can I do it?
Any example, help or suggestion would be much appreciated.

Related

How to compare two dataframe rows by rank

I have this dataframe (this is only a part of it):
replicate N_offspring group survival rank offs rank sur
27 H-CC-81 339 CCC 87 7 13
28 H-CC-82 285 CCC 89 16 12
29 H-CC-83 261 CCC 82 18 19
30 H-CC-84 312 CCC 108 12 5
31 H-CC-85 205 CCC 84 26 15
32 H-CC-86 153 CCC 59 28 27
I want to do a test on the 'n_offspring' and 'survival' rows based on each of their separate ranks(rank offs,rank sur).
for example, 'N_offspring' that is 'rank off'= 20 will go against 'survival that is 'rank sur'=20
I used sort.values two groups and then compare the rows that I needed:
def spearman_group_test(group):
group_value = the_dict[group]
rank_by_off = group_value.sort_values(by=['rank offspring'])
rank_by_sur = group_value.sort_values(by=['rank survival'])
s_test = stats.spearmanr(rank_by_off['N_offspring'],rank_by_sur['survival'])
print(group, s_test)
for keys in the_dict:
spearman_group_test(keys)

Optimizing the rewriting of the range values in a column into separate rows

I have a dataframe clothes_acc with column shoe_size containing values like:
index shoe_size
134 37-38
963 43-45
968 39-42
969 43-45
970 37-39
What I want to do is to write the whole range of values separately in each line. So I would get:
index shoe_size
134 37
134 38
963 43
963 44
963 45
968 39
968 40
968 41
968 42
...
Currently, I have the following code which works fine except it is very slow for the dataframe with 500k rows. (clothes_acc actually contains other values in the column that are not important here, which is why I take a subset of the dataframe with the mentioned values and save it in the tmp variable).
for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
clothes_acc = clothes_acc.drop([i])
spl = [int(s) for s in row['shoe_size']]
for j in range(spl[0],spl[1]+1):
replicate = row.copy()
replicate['shoe_size'] = str(j)
clothes_acc = clothes_acc.append(replicate)
clothes_acc.reset_index(drop=True,inplace=True)
Could anyone please suggest an improvement?
Convert the string range to a list of integer sizes and call explode():
df['shoe_size'] = df.apply(lambda x:
[i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1]) + 1)],
axis=1)
df = df.explode(column='shoe_size')
For example, if df is:
df = pd.DataFrame({
'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})
... this will give the following result:
shoe_size
0 37
0 38
1 43
1 44
1 45
2 39
2 40
2 41
2 42
3 43
3 44
3 45
4 37
4 38
4 39
One option (more memory intensive) is to extract the bounds of the ranges, merge on all possible values and then filter to where the merged value is between the range. This will work okay when the shoe_sizes overlap for many of the products so that the cross join isn't insanely huge.
import pandas as pd
# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
# index shoe_size lower upper
#0 134 37-38 37 38
#1 963 43-45 43 45
#2 968 39-42 39 42
#3 969 43-45 43 45
#4 970 37-39 37 39
vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(),
clothes_acc.upper.max()+1)})
res = (clothes_acc.drop(columns='shoe_size')
.merge(vals, how='cross')
.query('lower <= shoe_size <= upper')
.drop(columns=['lower', 'upper']))
print(res)
index shoe_size
0 134 37
1 134 38
15 963 43
16 963 44
17 963 45
20 968 39
21 968 40
22 968 41
23 968 42
33 969 43
34 969 44
35 969 45
36 970 37
37 970 38
38 970 39

Bar plot in python for categorical data

I am trying to create a bar for one of the column in dataset.
Column name is glucose and need a bar plot for three categoric values 0-100, 1-150, 151-200.
X=dataset('Glucose')
X.head(20)
0 148
1 85
2 183
3 89
4 137
5 116
6 78
7 115
8 197
9 125
10 110
11 168
12 139
13 189
14 166
15 100
16 118
17 107
18 103
19 115
not sure which approach to follow. could anyone please guide.
You can use pd.cut (Assuming X is a series) with value_counts:
pd.cut(X,[0,100,150,200]).value_counts().plot.bar()
bins=pd.IntervalIndex.from_tuples([(0,100),(101,150),(151,200)])

How to read the custom table in pandas which has number string number number?

I have been trying to read a custom table in pandas but am getting errors for a long time.
Here is the outline of table:
Number string number number
there is only one white space between two words
a word is a number or just an English word
there are no NANS
filename: station.tsv
794 Kissee Mills MO 140 73
824 Loma Mar CA 49 131
603 Sandy Hook CT 72 148
478 Tipton IN 34 98
619 Arlington CO 75 93
711 Turner AR 50 101
839 Slidell LA 85 152
411 Negreet LA 99 105
588 Glencoe KY 46 136
665 Chelsea IA 99 60
957 South El Monte CA 74 80
Note that the row `957 South El Monte CA 74 80` is
actually 33rd row for my data.
If it was only 11th row,
pandas gives no error,
but if it is large nth row it gives error.
My attempt
df = pd.read_csv('station.tsv', header=None, sep=' ')
ParserError: Error tokenizing data.
C error: Expected 7 fields in line 33, saw 8
Question
Is there a way to parse the data with some regex something like:
regexp = r'(\d+)\s+(\w+)\s+(\d+)\s+(\d+)'
To read the text data and make an array from them.
I am expecting to use NUMPY, PANDAS or any other python library for this.
You can specify a delimiter that is a space not preceded by a letter (?<![a-zA-Z])\s, or | a space that is followed by a number \s(?=\d).
sep = r'(?<![a-zA-Z])\s|\s(?=\d)'
df = pd.read_csv('station.tsv', engine='python', sep=sep, header=None)
0 1 2 3
0 794 Kissee Mills MO 140 73
1 824 Loma Mar CA 49 131
2 603 Sandy Hook CT 72 148
3 478 Tipton IN 34 98
4 619 Arlington CO 75 93
5 711 Turner AR 50 101
6 839 Slidell LA 85 152
7 411 Negreet LA 99 105
8 588 Glencoe KY 46 136
9 665 Chelsea IA 99 60
10 957 South El Monte CA 74 80
df.dtypes
#0 int64
#1 object
#2 int64
#3 int64

python pandas: Grouping dataframe by ranges

I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()

Categories