Python Pandas - Column - All possible combination - python

I have a dataset with 2 columns (Name and Score) and I would like to split the column Name in 2 groups: group 1 and group 2. Then, I will have to calculate sum(score group 1) / sum(score group 2). My goal is to find in all combinations of names and groups, what is the smallest sum(score group 1) / sum(score group 2).
df = pd.DataFrame({
'Name': list('ABCDEF'),
'Score': [600, 1000, 300, 100, -100, 3000],
}, columns=['Name', 'Score'])
df
Name Score
0 A 600
1 B 1000
2 C 300
3 D 100
4 E -100
5 F 3000
Example of first interaction:
Group Name Score
0 1 A 600
1 2 B 1000
2 2 C 300
3 2 D 100
4 2 E -100
5 2 F 3000
sum(score group 1) / sum(score group 2) = 0.1395
Example of second interaction:
Group Name Score
0 1 A 600
1 1 B 1000
2 2 C 300
3 2 D 100
4 2 E -100
5 2 F 3000
sum(score group 1) / sum(score group 2) = 0.4848
And then, calculate score for all combinations and get the smallest sum(score group 1) / sum(score group 2)

I've updated my solution so it is in line with your examples.
Essentially, all possible combinations of your groups can be generated from the binary representation of all the numbers in the range 1 to 2**(len(df.index)) - 1.
You then convert these binary representations to a list of bools (comb_bools) to allow them to be passed to your dataframe - group1 is represented by comb_bools and group2 by not(comb_bools).
Once you have these lists you can easily calculate the value your require and store them in the list result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': list('ABCDEF'),
'Score': [600, 1000, 300, 100, -100, 3000],},
columns=['Name', 'Score'])
nb_combs = 2**len(df.index) - 1
group1 = []
group2 = []
result = []
for i in range(1, nb_combs):
comb = list(map(int, list(format(i, '0' + str(len(df.index)) + 'b'))))
comb_bools = list(map(bool, comb))
group1.append(df[comb_bools]['Name'].values)
group2.append(df[[not(j) for j in comb_bools]]['Name'].values)
numerator = sum(df[df['Name'].isin(group1[i - 1])]['Score'].values)
denominator = sum(df[df['Name'].isin(group2[i - 1])]['Score'].values)
result.append(numerator / denominator)
min_idx = np.argmin(result)
print('Minimum value: {}'.format(result[min_idx]))
print('Corresponding Group1: {}'.format(group1[min_idx]))
print('Corresponding Group2: {}\n'.format(group2[min_idx]))
Output:
Minimum value: -50.0
Corresponding Group1: ['A' 'B' 'C' 'D' 'F']
Corresponding Group2: ['E']

Related

Labeling whether the numbers in a dataframe is going up first or down first

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0

More than one condition meet numpy select

I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F

How to compare columns and return values recursively?

I have a data frame that looks like this:
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
I'm getting stuck implementing this comparison:
If factor_1 and factor_2 values match, then output = 2 * multi (Here 2 is kind of a base value). Continue scanning the next rows.
If factor_1 and factor_2 values don't match then:
output = -2. Scan the next row(s).
If factor values still don't match until row R then assign values for output as $-2^2, -2^3, ..., -2^R$ respectively.
When factor values match at row R+1 then assign value for output as $2^(R+1) * multi$.
Repeat the process
The end result will look like this:
This solution does not use recursion:
# sample data
np.random.seed(1)
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
# create a mask
mask = (df['factor_1'] != df['factor_2'])
# get the cumsum from the mask
df['R'] = mask.cumsum() - mask.cumsum().where(~mask).ffill().fillna(0)
# use np.where to create the output
df['output'] = np.where(df['R'] == 0, df['multi']*2, -2**df['R'])
factor_1 factor_2 multi output R
0 2 1 0.419195 -2.000000 1.0
1 4 2 0.685220 -4.000000 2.0
2 1 1 0.204452 0.408904 0.0
3 1 4 0.878117 -2.000000 1.0
4 4 2 0.027388 -4.000000 2.0
5 2 1 0.670468 -8.000000 3.0
6 4 3 0.417305 -16.000000 4.0
7 2 2 0.558690 1.117380 0.0
8 4 3 0.140387 -2.000000 1.0
9 1 1 0.198101 0.396203 0.0
The solution I present is, maybe, a little bit harder to read, but I think it works as you wanted. It combines
numpy.where() in order to make a column based on a condition,
pandas.DataFrame.shift() and pandas.DataFrame.cumsum() to label different groups with consecutive similar values, and
pandas.DataFrame.rank() in order to construct a vector of powers used on previously made df['output'] column.
The code is following.
df['output'] = np.where(df.factor_1 == df.factor_2, -2 * df.multi, 2)
group = ['output', (df.output != df.output.shift()).cumsum()]
df['output'] = (-1) * df.output.pow(df.groupby(group).output.rank('first'))
flag = False
cols = ('factor_1', 'factor_2', 'multi')
z = zip(*[data_dict[col] for col in cols])
for i, (f1, f2, multi) in enumerate(z):
if f1==f2:
output = 2 * multi
flag = False
else:
if flag:
output *= 2
else:
output = -2
flag = True
data_dict['output'][i] = output
The tricky part is flag variable, which tells you whether the previous row had match or not.

Can we use a pandas data frame to calculate the next value using a previous value? A good example would be the Fibonacci numbers

So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0

How to get location of nearest number in DataFrame

I have following DataFrame
0
0 5
1 10
2 15
3 20
i want get location of value which is near to n value. For example: if n=7 then nearest number is 5 .
And after this return location of 5 i.e [0] [0]
Use Series.abs and Series.idxmin:
# Setup
df = pd.DataFrame({0: {0: 5, 1: 10, 2: 15, 3: 20}})
n = 7
(n - df[0]).abs().idxmin()
[out]
0
Use numpy.argmin to get the closest number index:
df[1] = 7
df[2] = df[1] - df[0]
df[2] = df[2].abs()
print(np.argmin(df[2]))

Categories