Python: How to create weighted quantiles in Pandas? - python

I understand how to create simple quantiles in Pandas using pd.qcut. But after searching around, I don't see anything to create weighted quantiles. Specifically, I wish to create a variable which bins the values of a variable of interest (from smallest to largest) such that each bin contains an equal weight. So far this is what I have:
def wtdQuantile(dataframe, var, weight = None, n = 10):
if weight == None:
return pd.qcut(dataframe[var], n, labels = False)
else:
dataframe.sort_values(var, ascending = True, inplace = True)
cum_sum = dataframe[weight].cumsum()
cutoff = max(cum_sum)/n
quantile = cum_sum/cutoff
quantile[-1:] -= 1
return quantile.map(int)
Is there an easier way, or something prebuilt from Pandas that I'm missing?
Edit: As requested, I'm providing some sample data. In the following, I'm trying to bin the "Var" variable using "Weight" as the weight. Using pd.qcut, we get an equal number of observations in each bin. Instead, I want an equal weight in each bin, or in this case, as close to equal as possible.
Weight Var pd.qcut(n=5) Desired_Rslt
10 1 0 0
14 2 0 0
18 3 1 0
15 4 1 1
30 5 2 1
12 6 2 2
20 7 3 2
25 8 3 3
29 9 4 3
45 10 4 4

I don't think this is built-in to Pandas, but here is a function that does what you want in a few lines:
import numpy as np
import pandas as pd
from pandas._libs.lib import is_integer
def weighted_qcut(values, weights, q, **kwargs):
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = weights.iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()
We can test it on your data this way:
data = pd.DataFrame({
'var': range(1, 11),
'weight': [10, 14, 18, 15, 30, 12, 20, 25, 29, 45]
})
data['qcut'] = pd.qcut(data['var'], 5, labels=False)
data['weighted_qcut'] = weighted_qcut(data['var'], data['weight'], 5, labels=False)
print(data)
The output matches your desired result from above:
var weight qcut weighted_qcut
0 1 10 0 0
1 2 14 0 0
2 3 18 1 0
3 4 15 1 1
4 5 30 2 1
5 6 12 2 2
6 7 20 3 2
7 8 25 3 3
8 9 29 4 3
9 10 45 4 4

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write a python program to find all the local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides
Note
Create a Pandas series from the given input.
Input format:
First line of the input consists of list of integers separated by spaces to from pandas series.
Output format:
Output display the array of indices where peak values present.
Sample testcase
input1
12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
output1
[2 5 10 12]
smapletest cases image
How to solve this problem?
import pandas as pd
a = "12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"
a = [int(x) for x in a.split(" ")]
angles = []
for i in range(len(a)):
if i!=0:
if a[i]>a[i-1]:
angles.append('rise')
else:
angles.append('fall')
else:
angles.append('ignore')
prev=0
prev_val = "none"
counts = []
for s in angles:
if s=="fall" and prev_val=="rise":
prev_val = s
counts.append(1)
else:
prev_val = s
counts.append(0)
peaks_pd = pd.Series(counts).shift(-1).fillna(0).astype(int)
df = pd.DataFrame({
'a':a,
'peaks':peaks_pd
})
peak_vals = list(df[df['peaks']==1]['a'].index)
This could be improved further. Steps I have followed:
First find the angle whether its rising or falling
Look at the index at which it starts falling after rising and call it as peaks
Use:
data = [12, 1, 2, 1.1, 9, 10, 2.1, 5, 7, 8, 9.1, -9, 10.1, 5.1, 15]
s = pd.Series(data)
n = 3 # number of points to be checked before and after
from scipy.signal import argrelextrema
local_max_index = argrelextrema(s.to_frame().to_numpy(), np.greater_equal, order=n)[0].tolist()
print (local_max_index)
[0, 5, 14]
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
local_max_index = s.index[s == s.rolling(n, center=True).max()].tolist()
print (local_max_index)
[2, 5, 10, 12]
EDIT: Solution for processing value in DataFrame:
df = pd.DataFrame({'Input': ["12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"]})
print (df)
Input
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
s = df['Input'].iloc[[0]].str.split().explode().astype(int).reset_index(drop=True)
print (s)
0 12
1 1
2 2
3 1
4 9
5 10
6 2
7 5
8 7
9 8
10 9
11 -9
12 10
13 5
14 15
Name: Input, dtype: int32
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
df['output'] = [local_max_index]
print (df)
Input output
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15 [2, 5, 10, 12]

Shuffling an array except the first and the last element in Python

I am generating a normal distribution but keeping the mean and std exactly the same by using np.random.seed(0). I am trying to shuffle r except the first and the last elements of the array but it keeps the remaining elements at the same location in the array as shown in the current output. I also present the expected output.
import numpy as np
np.random.seed(0)
mu, sigma = 50, 2.0 # mean and standard deviation
Nodes=10
r = np.random.normal(mu, sigma, Nodes)
sort_r = np.sort(r);
r1=sort_r[::-1]
r1=r1.reshape(1,Nodes)
r2 = r.copy()
np.random.shuffle(r2.ravel()[1:])
r2=r2.reshape(1,Nodes) #actual radius values in mu(m)
maximum = r2.max()
indice1 = np.where(r2 == maximum)
r2[indice1] = r2[0][0]
r2[0][0] = maximum
r2[0][Nodes-1] = maximum #+0.01*maximum
print("r2 with max at (0,0)=",[r2])
The current output for many runs is
r2 with max at (0,0)= [array([[54.4817864 , 51.90017684, 53.52810469, 53.73511598, 48.04544424,
51.95747597, 50.80031442, 50.821197 , 49.7935623 , 54.4817864 ]])]
The expected output is (shuffling all elements randomly except the first and the last element)
Run 1: r2 with max at (0,0)= [array([[54.4817864 , 53.52810469, 51.90017684, ,53.73511598, 48.04544424,49.7935623 ,50.80031442, 50.821197 , 51.95747597, 54.4817864 ]])]
Run 2: r2 with max at (0,0)= [array([[54.4817864 , 51.90017684,53.52810469, 48.04544424, 53.73511598, 51.95747597, 49.7935623 ,50.80031442, 50.821197 , 54.4817864 ]])]
It's not that clear from your question what do you include in a run.
If, like it seems, you're initializing distribution and seed every time, shuffling it once will always give you the same result. It must be like that because random state is fixed, just like you want your random numbers to be predictable also the shuffle operation will return always the same result.
Let me show you what I mean with some simpler code than yours:
# reinit distribution and seed at each run
for run in range(5):
np.random.seed(0)
a = np.random.randint(10, size=10)
np.random.shuffle(a)
print(f'{run}:{a}')
Which will print
0:[2 3 9 0 3 7 4 5 3 5]
1:[2 3 9 0 3 7 4 5 3 5]
2:[2 3 9 0 3 7 4 5 3 5]
3:[2 3 9 0 3 7 4 5 3 5]
4:[2 3 9 0 3 7 4 5 3 5]
What you want is to initialize your distribution once and shuffle it at each run:
# init distribution and just shuffle it at each run
np.random.seed(0)
a = np.random.randint(10, size=10)
for run in range(5):
np.random.shuffle(a)
print(f'{run}:{a}')
Which will print:
0:[2 3 9 0 3 7 4 5 3 5]
1:[9 0 3 4 2 5 7 3 3 5]
2:[2 0 3 3 3 5 7 5 4 9]
3:[5 3 5 3 0 2 7 4 9 3]
4:[3 9 3 2 5 7 3 4 0 5]

How can I evenly split up a pandas.DataFrame into n-groups? [duplicate]

This question already has answers here:
Split dataframe into relatively even chunks according to length
(2 answers)
Closed 1 year ago.
I need to perform n-fold (in my particular case, a 5-fold) cross validation on a dataset that I've stored in a pandas.DataFrame. My current way seems to rearrange the row labels;
spreadsheet1 = pd.ExcelFile("Testing dataset.xlsx")
dataset = spreadsheet1.parse('Sheet1')
data = 5 * [pd.DataFrame()]
i = 0
while(i < len(dataset)):
j = 0
while(j < 5 and i < len(dataset)):
data[j] = (data[j].append(dataset.iloc[i])).reset_index(drop = True)
i += 1
j += 1
How can I split my DataFrame efficiently/intelligently without tampering with the order of the columns?
Use np.array_split to break it up into a list of "evenly" sized DataFrames. You can shuffle too if you sample the full DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(-1,2), columns=['A', 'B'])
N = 5
np.array_split(df, N)
#np.array_split(df.sample(frac=1), N) # Shuffle and split
[ A B
0 0 1
1 2 3
2 4 5,
A B
3 6 7
4 8 9
5 10 11,
A B
6 12 13
7 14 15,
A B
8 16 17
9 18 19,
A B
10 20 21
11 22 23]
I am still not sure why you want to do it in this way but here is a solution
df['fold'] = np.random.randint(1, 6, df.shape[0])
For example, your first fold is
df.loc[df['fold'] == 1]

Difficulty displaying histogram for every variable

I need to plot histograms for numeric variables, in order to determine if their distributions are skewed. Below is the function definition, and the function being called.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
def variable_type(df, nominal_level = 3):
categorical, numeric, nominal = [],[],[]
for variable in df.columns.values:
if np.issubdtype(np.array(df[variable]).dtype, int) or np.issubdtype(np.array(df[variable]).dtype, float): #if srray variable is of type int or float
if len(np.unique(np.array(df[variable]))) <= nominal_level:
nominal.append(variable)
else:
numeric.append(variable)
else:
categorical.append(variable)
return numeric,categorical,nominal
def draw_histograms(df, variables, n_rows, n_cols):
fig = plt.figure()
import math
for i in range(min(n_rows * n_cols, len(variables))):
index = n_rows * 100 + n_cols * 10 + i + 1
ax = fig.add_subplot(index)
df[variables[i]].hist(bins = 20, ax = ax)
plt.title(variables[i]+' distribution')
#plt.xlabel(variables[i])
#plt.ylabel('Count')
plt.show()
def main():
df = read_data()
col_names = df.columns.tolist()
numeric,categorical,nominal = variable_type(df)
util.draw_histograms(df, numeric, 3, 3)
if __name__ == "__main__":
main()
My program only works when I use 3, 3 for n_rows and n_cols in the calling function, and this is a problem because it only plots 9 of the 20 variables. If I try any other numbers, I get a ValueError: num must be 1 <= num <= 18, not 0 or some other range depending on my chosen n_rows and n_cols. What can I do to plot all 20 numeric variables as subplots on one figure? or should I break it into different figures? This is a sample of my data frame.
TARGET_B ID GiftCnt36 GiftCntAll GiftCntCard36 GiftCntCardAll \
0 0 14974 2 4 1 3
1 0 6294 1 8 0 3
2 1 46110 6 41 3 20
3 1 185937 3 12 3 8
4 0 29637 1 1 1 1
GiftAvgLast GiftAvg36 GiftAvgAll GiftAvgCard36 ... \
0 17 13.50 9.25 17.00 ...
1 20 20.00 15.88 NaN ...
2 6 5.17 3.73 5.00 ...
3 10 8.67 8.50 8.67 ...
4 20 20.00 20.00 20.00 ...
PromCntCardAll StatusCat96NK StatusCatStarAll DemCluster DemAge \
0 13 A 0 0 NaN
1 24 A 0 23 67
2 22 S 1 0 NaN
3 16 E 1 0 NaN
4 6 F 0 35 53
DemGender DemHomeOwner DemMedHomeValue DemPctVeterans DemMedIncome
0 F U $0 0 $0
1 F U $186,800 85 $0
2 M U $87,600 36 $38,750
3 M U $139,200 27 $38,942
4 M U $168,100 37 $71,509
There is a NaN in your 10th attribute. Can your code handle this?
An you plot the 10th attribute?

Binning values into groups with a minimum size using pandas

I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()

Categories