How to get a stratified random sample of indices? - python

I have an array (pd.Series) of two values (A's and B's, for example).
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B
I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.
For example
get_random_stratified_sample_of_indices(y=y, n=4)
[0, 1, 2, 4]
The indices 0 and 2 correspond with the indices of A's, and the indices of 1 and 4 correspond with the indices of B's.
Another example
get_random_stratified_sample_of_indices(y=y, n=6)
[1, 4, 5, 0, 2, 3]
The order of the returned list of indices doesn't matter but I need it to be even split between indices of A's and B's from the y array.
My plan was to first look at the indices of A's, then take a random sample (size=n/2) of the indices. And then repeat for B.

You can use groupby.sample:
N = 4
idx = (y
.index.to_series()
.groupby(y)
.sample(n=N//len(y.unique()))
.to_list()
)
Output: [3, 8, 10, 1]
Check:
3 A
8 A
10 B
1 B
dtype: object

Here's one way to do it:
def get_random_stratified_sample_of_indices(s, n):
mask = s == 'A'
s1 = s[mask]
s2 = s[~mask]
m1 = n // 2
m2 = m1 if n % 2 == 0 else m1 + 1
i1 = s1.sample(m1).index.to_list()
i2 = s2.sample(m2).index.to_list()
return i1 + i2
Which could be used in this way:
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
i = get_random_stratified_sample_of_indices(y, 5)
print(i)
print()
print(y[i])
Result:
[6, 2, 7, 10, 5]
6 A
2 A
7 B
10 B
5 B

I think you could use the train_test_split from Scikit-Learn, defining its stratify parameter.
from sklearn.model_selection import train_test_split
import pandas as pd
y = (
pd.Series(["A", "B", "A", "A", "B", "B", "A", "B", "A", "B", "B"])
.T.to_frame("col")
.assign(i=lambda xdf: xdf.index)
)
print(y)
# Prints:
#
# col i
# 0 A 0
# 1 B 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 A 6
# 7 B 7
# 8 A 8
# 9 B 9
# 10 B 10
print('\n')
# ===== Actual solution =====================================
a, b = train_test_split(y, test_size=0.5, stratify=y["col"])
# ===========================================================
print(a)
# Prints:
#
# col i
# 10 B 10
# 6 A 6
# 7 B 7
# 8 A 8
# 4 B 4
print('\n')
print(b)
# Prints:
#
# col i
# 3 A 3
# 9 B 9
# 2 A 2
# 1 B 1
# 5 B 5
# 0 A 0

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write a python program to find all the local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides
Note
Create a Pandas series from the given input.
Input format:
First line of the input consists of list of integers separated by spaces to from pandas series.
Output format:
Output display the array of indices where peak values present.
Sample testcase
input1
12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
output1
[2 5 10 12]
smapletest cases image
How to solve this problem?
import pandas as pd
a = "12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"
a = [int(x) for x in a.split(" ")]
angles = []
for i in range(len(a)):
if i!=0:
if a[i]>a[i-1]:
angles.append('rise')
else:
angles.append('fall')
else:
angles.append('ignore')
prev=0
prev_val = "none"
counts = []
for s in angles:
if s=="fall" and prev_val=="rise":
prev_val = s
counts.append(1)
else:
prev_val = s
counts.append(0)
peaks_pd = pd.Series(counts).shift(-1).fillna(0).astype(int)
df = pd.DataFrame({
'a':a,
'peaks':peaks_pd
})
peak_vals = list(df[df['peaks']==1]['a'].index)
This could be improved further. Steps I have followed:
First find the angle whether its rising or falling
Look at the index at which it starts falling after rising and call it as peaks
Use:
data = [12, 1, 2, 1.1, 9, 10, 2.1, 5, 7, 8, 9.1, -9, 10.1, 5.1, 15]
s = pd.Series(data)
n = 3 # number of points to be checked before and after
from scipy.signal import argrelextrema
local_max_index = argrelextrema(s.to_frame().to_numpy(), np.greater_equal, order=n)[0].tolist()
print (local_max_index)
[0, 5, 14]
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
local_max_index = s.index[s == s.rolling(n, center=True).max()].tolist()
print (local_max_index)
[2, 5, 10, 12]
EDIT: Solution for processing value in DataFrame:
df = pd.DataFrame({'Input': ["12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"]})
print (df)
Input
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
s = df['Input'].iloc[[0]].str.split().explode().astype(int).reset_index(drop=True)
print (s)
0 12
1 1
2 2
3 1
4 9
5 10
6 2
7 5
8 7
9 8
10 9
11 -9
12 10
13 5
14 15
Name: Input, dtype: int32
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
df['output'] = [local_max_index]
print (df)
Input output
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15 [2, 5, 10, 12]

Create dynamic nested for loops

I have some arrays m rows by 2 `columns (like series of coordinates) and I want to automatize my code so that I will not use nested loop for every coord. Here is my code it runs well and gives right answer coordinates but I want to make a dynamic loop:
import numpy as np
A = np.array([[1,5,7,4,6,2,2,6,7,2],[2,8,2,9,3,9,8,5,6,2],[3,4,0,2,4,3,0,2,6,7],\
[1,5,7,3,4,5,2,7,9,7],[6,2,8,8,6,7,9,6,9,7],[0,2,0,3,3,5,2,3,5,5],[5,5,5,0,6,6,8,5,9,0]\
,[0,5,7,6,0,6,9,9,6,7],[5,5,8,5,0,8,5,3,5,5],[0,0,6,3,3,3,9,5,9,9]])
number = 8292
number = np.asarray([int(i) for i in str(number)]) #split number into array
#the coordinates of every single value contained in required number
coord1=np.asarray(np.where(A == number[0])).T
coord2=np.asarray(np.where(A == number[1])).T
coord3=np.asarray(np.where(A == number[2])).T
coord4=np.asarray(np.where(A == number[3])).T
coordinates = np.array([[0,0]]) #initialize the array that will return all the desired coordinates
solutions = 0 #initialize the array that will give the number of solutions
for j in coord1:
j = j.reshape(1, -1)
for i in coord2 :
i=i.reshape(1, -1)
if (i[0,0]==j[0,0]+1 and i[0,1]==j[0,1]) or (i[0,0]==j[0,0]-1 and i[0,1]==j[0,1]) or (i[0,0]==j[0,0] and i[0,1]==j[0,1]+1) or (i[0,0]==j[0,0] and i[0,1]==j[0,1]-1) :
for ii in coord3 :
ii=ii.reshape(1, -1)
if (np.array_equal(ii,j)==0 and ii[0,0]==i[0,0]+1 and ii[0,1]==i[0,1]) or (np.array_equal(ii,j)==0 and ii[0,0]==i[0,0]-1 and ii[0,1]==i[0,1]) or (np.array_equal(ii,j)==0 and ii[0,0]==i[0,0] and ii[0,1]==i[0,1]+1) or (np.array_equal(ii,j)==0 and ii[0,0]==i[0,0] and ii[0,1]==i[0,1]-1) :
for iii in coord4 :
iii=iii.reshape(1, -1)
if (np.array_equal(iii,i)==0 and iii[0,0]==ii[0,0]+1 and iii[0,1]==ii[0,1]) or (np.array_equal(iii,i)==0 and iii[0,0]==ii[0,0]-1 and iii[0,1]==ii[0,1]) or (np.array_equal(iii,i)==0 and iii[0,0]==ii[0,0] and iii[0,1]==ii[0,1]+1) or (np.array_equal(iii,i)==0 and iii[0,0]==ii[0,0] and iii[0,1]==ii[0,1]-1) :
point = np.concatenate((j,i,ii,iii))
coordinates = np.append(coordinates,point,axis=0)
solutions +=1
coordinates = np.delete(coordinates, (0), axis=0)
import itertools
A = [1, 2, 3]
B = [4, 5, 6]
C = [7, 8, 9]
for (a, b, c) in itertools.product (A, B, C):
print (a, b, c);
outputs:
1 4 7
1 4 8
1 4 9
1 5 7
1 5 8
1 5 9
1 6 7
1 6 8
1 6 9
2 4 7
2 4 8
2 4 9
2 5 7
2 5 8
2 5 9
2 6 7
2 6 8
2 6 9
3 4 7
3 4 8
3 4 9
3 5 7
3 5 8
3 5 9
3 6 7
3 6 8
3 6 9
See documentation for details.

Numpy: recode numeric array to which quintile each element belongs

I have a numeric vector a:
import numpy as np
a = np.random.rand(100)
I wish to get the vector (or any other vector) recoded so that each element is either 0, 1, 2, 3 or 4, according to which a quintile it is in (could be more general for any quantile, like quartile, decile etc.).
This is what I'm doing. There has to be something more elegant, no?
from scipy.stats import percentileofscore
n_quantiles = 5
def get_quantile(i, a, n_quantiles):
if a[i] >= max(a):
return n_quantiles - 1
return int(percentileofscore(a, a[i])/(100/n_quantiles))
a_recoded = np.array([get_quantile(i, a, n_quantiles) for i in range(len(a))])
print(a)
print(a_recoded)
[0.04708996 0.86267278 0.23873192 0.02967989 0.42828385 0.58003015
0.8996666 0.15359369 0.83094778 0.44272398 0.60211289 0.90286434
0.40681163 0.91338397 0.3273745 0.00347029 0.37471307 0.72735901
0.93974808 0.55937197 0.39297097 0.91470761 0.76796271 0.50404401
0.1817242 0.78244809 0.9548256 0.78097562 0.90934337 0.89914752
0.82899983 0.44116683 0.50885813 0.2691431 0.11676798 0.84971927
0.38505195 0.7411976 0.51377242 0.50243197 0.89677377 0.69741088
0.47880953 0.71116534 0.01717348 0.77641096 0.88127268 0.17925502
0.53053573 0.16935597 0.65521692 0.19042794 0.21981197 0.01377195
0.61553814 0.8544525 0.53521604 0.88391848 0.36010949 0.35964882
0.29721931 0.71257335 0.26350287 0.22821314 0.8951419 0.38416004
0.19277649 0.67774468 0.27084229 0.46862229 0.3107887 0.28511048
0.32682302 0.14682896 0.10794566 0.58668243 0.16394183 0.88296862
0.55442047 0.25508233 0.86670299 0.90549872 0.04897676 0.33042884
0.4348465 0.62636481 0.48201213 0.49895892 0.36444648 0.01410316
0.46770595 0.09498391 0.96793139 0.03931124 0.64286295 0.50934846
0.59088907 0.56368594 0.7820928 0.77172038]
[0 4 1 0 2 3 4 0 4 2 3 4 2 4 1 0 1 3 4 2 1 4 3 2 0 3 4 3 4 4 4 2 2 1 0 4 1
3 2 2 4 3 2 3 0 3 4 0 2 0 3 0 1 0 3 4 2 4 1 1 1 3 1 1 4 1 0 3 1 2 1 1 1 0
0 3 0 4 2 1 4 4 0 1 2 3 2 2 1 0 2 0 4 0 3 2 3 2 3 3]
Update: just wanted to say this is so easy in R:
How to get the x which belongs to a quintile?
You could use argpartition. Example:
>>> a = np.random.random(20)
>>> N = len(a)
>>> nq = 5
>>> o = a.argpartition(np.arange(1, nq) * N // nq)
>>> out = np.empty(N, int)
>>> out[o] = np.arange(N) * nq // N
>>> a
array([0.61238649, 0.37168998, 0.4624829 , 0.28554766, 0.00098016,
0.41979328, 0.62275886, 0.4254548 , 0.20380679, 0.762435 ,
0.54054873, 0.68419986, 0.3424479 , 0.54971072, 0.06929464,
0.51059431, 0.68448674, 0.97009023, 0.16780152, 0.17887862])
>>> out
array([3, 1, 2, 1, 0, 2, 3, 2, 1, 4, 3, 4, 1, 3, 0, 2, 4, 4, 0, 0])
Here's one way to do it using pd.cut()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100))
df.columns = ['values']
# Apply the quantiles
gdf = df.groupby(pd.cut(df.loc[:, 'values'], np.arange(0, 1.2, 0.2)))['values'].apply(lambda x: list(x)).to_frame()
# Make use of the automatic indexing to assign quantile numbers
gdf.reset_index(drop=True, inplace=True)
# Re-expand the grouped list of values. Method provided by #Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
gdf['values'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('values').reset_index()

Python Pandas: Subsetting data frame both by rows and columns?

Data frame has w (week) and y (year) columns.
d = {
'y': [11,11,13,15,15],
'w': [5, 4, 7, 7, 8],
'z': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(d)
In [61]: df
Out[61]:
w y z
0 5 11 1
1 4 11 2
2 7 13 3
3 7 15 4
4 8 15 5
Two questions:
1) How to get from this data frame min/max date as two numbers w and y in a list [w,y] ?
2) How to subset both columns and rows, so all w and y in the resulting data frame are constrained by conditions:
11 <= y <= 15
4 <= w <= 7
To get min/max pairs I need functions:
min_pair() --> [11,4]
max_pair() --> [15,8]
and these to get a data frame subset:
from_to(y1,w1,y2,w2)
from_to(11,4,15,7) -->
should return rf data frame like this:
r = {
'y': [11,13,15],
'w': [4, 7, 7 ],
'z': [2, 3, 4 ]
}
rf = pd.DataFrame(r)
In [62]: rf
Out[62]:
w y z
0 4 11 2
1 7 13 3
2 7 15 4
Are there any standard functions for this?
Update
For subsetting the following worked for me:
df[(df.y <= 15 ) & (df.y >= 11) & (df.w >= 4) & (df.w <= 7)]
a lot of typing though ...
Here are couple of methods
In [176]: df.min().tolist()
Out[176]: [4, 11]
In [177]: df.max().tolist()
Out[177]: [8, 15]
In [178]: df.query('11 <= y <= 15 and 4 <= w <= 7')
Out[178]:
w y
0 5 11
1 4 11
2 7 13
3 7 15

Inserting counter objects into Dataframe python

I am confused with insert counter (collections) into a dataframe:
My dataframe looks like,
doc_cluster_key_freq=pd.DataFrame(index=[], columns=['doc_parent_id','keyword_id','key_count_in_doc_cluster'])
sim_docs_ids=[3342,3783]
the counters generated in for the sim_docs_ids are given below
id=3342
Counter({133: 9, 79749: 7})
id=3783
Counter({133: 10, 12072: 5, 79749: 1})
The counter is generated in loop for each sim_docs_id
My code looks like:
for doc_ids in sim_docs_ids:
#generate counter for doc_ids
#insert the counter into dataframe (doc_cluster_key_freq) here
The output I am looking for is as below:
doc_cluster_key_freq=
doc_parent_id Keyword_id key_count_in_doc_cluster
0 3342 133 9
1 3342 79749 7
2 3783 133 10
3 3783 12072 5
4 3783 79749 1
I tried by using counter.keys() and counter.values but I get something like below, I have no idea how to separate them into different rows:
doc_parent_id Keyword_id key_count_in_doc_cluster
0 33342 [133, 79749] [9, 7]
1 3783 [12072, 133, 79749] [5, 10, 1]
If you have the same number of keyword for each doc_id, you may pre-allocate proper row number for each record, and use the code below to ensure one row for each keyword in every doc_id:
keywords = ['key1', 'key2', 'key3', ...]
number_of_keywords = len(keywords)
for i, doc_id in enumerate(sim_doc_ids):
# Generate keyword Counter (counter) for doc_id
for j, key in enumerate(keywords):
doc_cluster_key_freq.loc[i * number_of_keywords + j] = [doc_id, key, counter[key]]
An example:
keywords = ['a', 'b', 'c']
N = len(keywords)
ids = range(5)
for i, idd in enumerate(ids):
counter = Counter({'a': random.randint(0, 10),
'b': random.randint(0, 10),
'c': random.randint(0, 10),})
for j, key in enumerate(keywords):
a.loc[i*N+j] = [idd, key, counter[key]]
Output:
id keyword count
0 0 a 10
1 0 b 9
2 0 c 9
3 1 a 1
4 1 b 10
5 1 c 10
6 2 a 9
7 2 b 0
8 2 c 5
9 3 a 6
10 3 b 0
11 3 c 8
12 4 a 0
13 4 b 3
14 4 c 8

Categories