Pandas Group by a range - python

I have a data like
{a : 100, b:102, c:500, d:99, e:78, d:88}
I want group it by a range with interval of 100.
Example:
{ 100: 2, 0: 3, 500:1 }
that is in English
2 occourances of a number between 100..199
1 occourances of a number between 500..599
3 occourances of a number between 0..99
How to express this in pandas?

IIUC, group by a range is usually pd.cut:
d = {'a' : 100, 'b':102,'c':500, 'd':99, 'e':78, 'd':88}
bins = np.arange(0,601,100)
pd.cut(pd.Series(d), bins=bins, labels=bins[:-1], right=False).value_counts(sort=False)
Output:
0 3
100 2
200 0
300 0
400 0
500 1
dtype: int64
Update: actually, pd.cut seems overkilled and your case is a bit easier:
(pd.Series(d)//100).value_counts(sort=False)
Output:
0 3
1 2
5 1
dtype: int64

Solution with maximal value of Series used for bins anf for labels all values without last by b[:-1] in cut, then count values by GroupBy.size:
d = {'a' : 100, 'b':102, 'c':500, 'd':99, 'e':78, 'f':88}
s = pd.Series(d)
max1 = int(s.max() // 100 + 1) * 100
b = np.arange(0, max1 + 100, 100)
print (b)
[ 0 100 200 300 400 500 600]
d1 = s.groupby(pd.cut(s, bins=b, labels=b[:-1], right=False)).size().to_dict()
print (d1)
{0: 3, 100: 2, 200: 0, 300: 0, 400: 0, 500: 1}

Related

Labeling whether the numbers in a dataframe is going up first or down first

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0

More than one condition meet numpy select

I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F

How to get location of nearest number in DataFrame

I have following DataFrame
0
0 5
1 10
2 15
3 20
i want get location of value which is near to n value. For example: if n=7 then nearest number is 5 .
And after this return location of 5 i.e [0] [0]
Use Series.abs and Series.idxmin:
# Setup
df = pd.DataFrame({0: {0: 5, 1: 10, 2: 15, 3: 20}})
n = 7
(n - df[0]).abs().idxmin()
[out]
0
Use numpy.argmin to get the closest number index:
df[1] = 7
df[2] = df[1] - df[0]
df[2] = df[2].abs()
print(np.argmin(df[2]))

Python Pandas - Column - All possible combination

I have a dataset with 2 columns (Name and Score) and I would like to split the column Name in 2 groups: group 1 and group 2. Then, I will have to calculate sum(score group 1) / sum(score group 2). My goal is to find in all combinations of names and groups, what is the smallest sum(score group 1) / sum(score group 2).
df = pd.DataFrame({
'Name': list('ABCDEF'),
'Score': [600, 1000, 300, 100, -100, 3000],
}, columns=['Name', 'Score'])
df
Name Score
0 A 600
1 B 1000
2 C 300
3 D 100
4 E -100
5 F 3000
Example of first interaction:
Group Name Score
0 1 A 600
1 2 B 1000
2 2 C 300
3 2 D 100
4 2 E -100
5 2 F 3000
sum(score group 1) / sum(score group 2) = 0.1395
Example of second interaction:
Group Name Score
0 1 A 600
1 1 B 1000
2 2 C 300
3 2 D 100
4 2 E -100
5 2 F 3000
sum(score group 1) / sum(score group 2) = 0.4848
And then, calculate score for all combinations and get the smallest sum(score group 1) / sum(score group 2)
I've updated my solution so it is in line with your examples.
Essentially, all possible combinations of your groups can be generated from the binary representation of all the numbers in the range 1 to 2**(len(df.index)) - 1.
You then convert these binary representations to a list of bools (comb_bools) to allow them to be passed to your dataframe - group1 is represented by comb_bools and group2 by not(comb_bools).
Once you have these lists you can easily calculate the value your require and store them in the list result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': list('ABCDEF'),
'Score': [600, 1000, 300, 100, -100, 3000],},
columns=['Name', 'Score'])
nb_combs = 2**len(df.index) - 1
group1 = []
group2 = []
result = []
for i in range(1, nb_combs):
comb = list(map(int, list(format(i, '0' + str(len(df.index)) + 'b'))))
comb_bools = list(map(bool, comb))
group1.append(df[comb_bools]['Name'].values)
group2.append(df[[not(j) for j in comb_bools]]['Name'].values)
numerator = sum(df[df['Name'].isin(group1[i - 1])]['Score'].values)
denominator = sum(df[df['Name'].isin(group2[i - 1])]['Score'].values)
result.append(numerator / denominator)
min_idx = np.argmin(result)
print('Minimum value: {}'.format(result[min_idx]))
print('Corresponding Group1: {}'.format(group1[min_idx]))
print('Corresponding Group2: {}\n'.format(group2[min_idx]))
Output:
Minimum value: -50.0
Corresponding Group1: ['A' 'B' 'C' 'D' 'F']
Corresponding Group2: ['E']

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Categories