How to compare two dataframe rows by rank - python

I have this dataframe (this is only a part of it):
replicate N_offspring group survival rank offs rank sur
27 H-CC-81 339 CCC 87 7 13
28 H-CC-82 285 CCC 89 16 12
29 H-CC-83 261 CCC 82 18 19
30 H-CC-84 312 CCC 108 12 5
31 H-CC-85 205 CCC 84 26 15
32 H-CC-86 153 CCC 59 28 27
I want to do a test on the 'n_offspring' and 'survival' rows based on each of their separate ranks(rank offs,rank sur).
for example, 'N_offspring' that is 'rank off'= 20 will go against 'survival that is 'rank sur'=20

I used sort.values two groups and then compare the rows that I needed:
def spearman_group_test(group):
group_value = the_dict[group]
rank_by_off = group_value.sort_values(by=['rank offspring'])
rank_by_sur = group_value.sort_values(by=['rank survival'])
s_test = stats.spearmanr(rank_by_off['N_offspring'],rank_by_sur['survival'])
print(group, s_test)
for keys in the_dict:
spearman_group_test(keys)

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!
First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

Bar plot in python for categorical data

I am trying to create a bar for one of the column in dataset.
Column name is glucose and need a bar plot for three categoric values 0-100, 1-150, 151-200.
X=dataset('Glucose')
X.head(20)
0 148
1 85
2 183
3 89
4 137
5 116
6 78
7 115
8 197
9 125
10 110
11 168
12 139
13 189
14 166
15 100
16 118
17 107
18 103
19 115
not sure which approach to follow. could anyone please guide.
You can use pd.cut (Assuming X is a series) with value_counts:
pd.cut(X,[0,100,150,200]).value_counts().plot.bar()
bins=pd.IntervalIndex.from_tuples([(0,100),(101,150),(151,200)])

Group by one columns and find sum and max value for another in pandas

I have a dataframe like this:
Name id col1 col2 col3 cl4
PL 252 0 747 3 53
PL2 252 1 24 2 35
PL3 252 4 75 24 13
AD 889 53 24 0 95
AD2 889 23 2 0 13
AD3 889 0 24 3 6
BG 024 12 89 53 66
BG1 024 43 16 13 0
BG2 024 5 32 101 4
And now I need to group by ID, and for columns col1 and col4 find the sum for each id and put that into a new column near to parent column (example: col3(sum)) But for col2 and col3 find max value.
Desired output:
Name id col1 col1(sum) col2 col2(max) col3 col(max) col4 col4(sum)
PL 252 0 5 747 747 3 24 6 18
PL2 252 1 5 24 747 2 24 12 18
PL3 252 4 5 75 747 24 24 0 18
AD 889 53 76 24 24 95 95 23 33
AD2 889 23 76 2 24 13 95 5 33
AD3 889 0 76 24 24 6 95 5 33
BG 024 12 60 89 89 66 66 0 67
BG1 024 43 60 16 89 0 66 63 67
BG2 024 5 60 32 89 4 66 4 67
What is the easiest and fastest way to calculate this?
The most (pandas) native way to do this, is to use the .agg() method that allows you to specify the aggregation function you want to apply per column (just like you would do in SQL).
Sample from the documentation:
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
You can use groupby/transform to creat the required columns
df[['col1_sum', 'col4_sum']]=df.groupby('id')['col1', 'cl4'].transform('sum')
df[['col2_max', 'col3_max']]=df.groupby('id')['col1', 'cl4'].transform('max')
Name id col1 col2 col3 cl4 col1_sum col4_sum col2_max col3_max
0 PL 252 0 747 3 53 5 101 4 53
1 PL2 252 1 24 2 35 5 101 4 53
2 PL3 252 4 75 24 13 5 101 4 53
3 AD 889 53 24 0 95 76 114 53 95
4 AD2 889 23 2 0 13 76 114 53 95
5 AD3 889 0 24 3 6 76 114 53 95
6 BG 24 12 89 53 66 60 70 43 66
7 BG1 24 43 16 13 0 60 70 43 66
8 BG2 24 5 32 101 4 60 70 43 66
You can use merge when you have groupby and sum on id :
pd.merge(df,df.groupby("id").sum().reset_index(), on='id',how='outer')
output
I know this is messy but I like chaining so you can do something like this:
df = df.groupby('id').
apply(lambda g: g.assign(
col1_sum=g.col1.sum(),
col2_max=g.col2.max()))
Basically, this is applying a group based assign command to each group and then combining into a single DataFrame.
See https://pandas.pydata.org/pandas-docs/stable/api.html for details on each method.

python pandas: Grouping dataframe by ranges

I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()

group the data by dates and find average in python

I have the following data:
Date Value
0 1/3/2014 778
1 1/4/2014 4554
2 1/5/2014 23
3 1/6/2014 767
4 1/7/2014 878
5 1/8/2014 678
6 1/9/2014 64
7 1/10/2014 344
8 1/11/2014 6576
9 1/12/2014 879
10 1/13/2014 5688
11 1/14/2014 688
12 1/15/2014 8799
13 1/16/2014 7899
14 1/17/2014 76
15 1/18/2014 868
16 1/19/2014 7976
17 1/20/2014 8679
18 1/21/2014 6976
19 1/22/2014 68
20 1/23/2014 754
21 1/24/2014 878
22 1/25/2014 9796
23 1/26/2014 57
24 1/27/2014 868
25 1/28/2014 868
26 1/29/2014 8778
27 1/30/2014 887
28 1/31/2014 765
29 2/1/2014 57
I would like to divide the data into a group of 15 consecutive day and find the average of the values. I have a naive way:
i = 15
j = 0
while i <= 30:
X = data[j:i].mean()
j = i
i = i + 15
print X
Is there a better way by say using group by in pandas?
try this:
df['Date'] = pd.to_datetime(df['Date'])
print(df.set_index('Date').groupby(pd.TimeGrouper('15D')).mean())
Output:
Value
Date
2014-01-03 2579.400000
2014-01-18 3218.333333

Categories