Inverse Score in Python - python

I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?

Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000

This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.

Related

calculate the median age for each region from frequency table with python

I have a dataframe that is similar to:
I would like to calculate the median age for each city but given that it is a frequency table I'm finding it somewhat tricky. Is there any function in pandas or other that would help me achieve this?
Maybe this works for you:
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(
[
['Alabama', 34, 67, 89, 89, 67, 545, 4546, 3, 23],
['Georgia', 345, 65, 67, 32, 23, 567, 87, 647, 68]
],
columns=['City', 0, 1, 2, 3, 4, 5, 6, 7, 8]
).set_index('City')
print(df)
# calculate median for freq table
m = list() # median list
for index, row in df.iterrows():
v = list() # value list
z = zip(row.index, row.values)
for item in z:
for f in range(item[1]):
v.append(item[0])
m.append(np.median(v))
df_m = pd.DataFrame({'City': df.index, 'Median': m})
print(df_m)
Input:
0 1 2 3 4 5 6 7 8
City
Alabama 34 67 89 89 67 545 4546 3 23
Georgia 345 65 67 32 23 567 87 647 68
Output:
City Median
0 Alabama 6.0
1 Georgia 5.0
For each row, find the number of instances there are. Then take that number, divide by 2, and determine what age that would be by checking if the number of people have the age smaller than what we are looking for.
For example, for the row 'alabama', you would add 34 + 67 + ... + 23 = 5463. That, divided by 2, would be 2731.5 ==> 2731. Then, checking each age group, determine where the 2731th person would be.
At age 1, since 2731 > 34, check the next.
At age 2, since 2731 > 34 + 67, check the next.
At age 3, since 2731 > 34 + 67 + 89, check the next.
At age 4, since 2731 > 34 + 67 + 89 + 89, check the next.
At age 5, since 2731 > 34 + 67 + 89 + 89 + 67, check the next.
At age 6, since 2731 > 34 + 67 + 89 + 89 + 67 + 545, check the next.
At age 7, since 2731 < 34 + 67 + 89 + 89 + 67 + 545 + 4546, the median has to be in this age group.
Do this repeatedly for each city/state, and you should get the median for each one.

Loop over arrays in a dataframe with an if statement

I am trying to write code that loops over the following code for columns in a dataframe: four times for four different arrays:
median_alcohol = df.alcohol.median()
for i, alcohol in enumerate(df.alcohol):
if alcohol >= median_alcohol:
df.loc[i, 'alcohol'] = 'high'
else:
df.loc[i, 'alcohol'] = 'low'
df.groupby('alcohol').quality.mean()
The columns in the dataframe are:
alcohol
pH
residual_sugar
citric_acid
I am trying to come up with a method to capture the four different arrays. Any ideas how I should go about this?
I'm not sure what actually you're trying to do, but, from what I understood, you could try something like this:
import pandas as pd
from statistics import mean
df = pd.DataFrame({'alcohol':[45, 88, 56, 15, 71], 'pH':[12, 83, 56, 25,71],'residual_sugar':[14, 25, 55, 8, 21]})
print(df)
#Output
>>> alcohol pH residual_sugar
0 45 12 14
1 88 83 25
2 56 56 55
3 15 25 8
4 71 71 21
def func(colum):
dftemp=df.copy()
median_colum = eval('df.'+colum).median()
for i, item in enumerate(eval('df.'+colum)):
dftemp.loc[i, colum] = 'high' if item >= median_colum else 'low'
return dftemp.groupby(colum).agg(list).applymap(mean)
diferrentarrays = [func(i) for i in df.columns]
for array in diferrentarrays:
print(array)
Output:
pH residual_sugar
alcohol
high 70.0 33.666667
low 18.5 11.000000
alcohol residual_sugar
pH
high 71.666667 33.666667
low 30.000000 11.000000
alcohol pH
residual_sugar
high 71.666667 70.0
low 30.000000 18.5
def numeric_to_buckets(df, column_name):
median = df[column_name].median()
for i, val in enumerate(df[column_name]):
if val >= median:
df.loc[i, column_name] = 'high'
else:
df.loc[i, column_name] = 'low'
for feature in df.columns[:-1]:
numeric_to_buckets(df, feature)
print(df.groupby(feature).quality.mean(), '\n')

Pandas hierarchical indexes and calculations

Given:
df = pd.DataFrame({"panum": ["PA1", "PA1", "PA1", "PA2", "PA2", "PA2"],
"which": ["A", "A", "A", "B", "B", "B"],
"score": [88, 80, 90, 92, 95, 99]})
df.set_index(['panum', 'which'], inplace=True)
df
score
panum which
PA1 A 88
A 80
A 90
PA2 B 92
B 95
B 99
Is it possible to write something that would create a new index entry in 'which' called max which would be the max but for the level, so it would create two new rows, PA1,Max and PA2,Max?
Update
I have corrected the indexes. The example above is not what I meant.
panmum factor score
PA1 init 90
resub 94
final 93
PA2 init 60
resub 90
final 88
And my question in this better scenario would be: "I want to create a new "panum" called mean, which would have three rows, (mean, init), (mean, resub), (mean, final)".
Pseudocode would be something like df['mean'] = (df['pa1'] + df['pa2']) / 2
I know this is a different question!
You can create new DataFrame of max values, add second level max, append to original and last sort_index:
m = df.max(level=0).assign(max='max').set_index('max', append=True)
print (m)
score
panum max
PA1 max 90
PA2 max 99
df = df.append(m).sort_index()
print (df)
score
panum which
PA1 A 88
A 80
A 90
max 90
PA2 B 92
B 95
B 99
max 99
EDIT answer: solution is changed for mean by second level and swaplevel for correct align to final DataFrame:
df = pd.DataFrame({"panum": ["PA1", "PA1", "PA1", "PA2", "PA2", "PA2"],
"factor": ["init", "resub", "final"] * 2,
"score": [90, 94, 93, 60, 90, 88]})
df.set_index(['panum', 'factor'], inplace=True)
print (df)
score
panum factor
PA1 init 90
resub 94
final 93
PA2 init 60
resub 90
final 88
m = (df.mean(level=1)
.assign(factor='mean')
.set_index('factor', append=True)
.swaplevel(0,1))
print (m)
score
factor factor
mean init 75.0
resub 92.0
final 90.5
df = df.append(m)
print (df)
score
panum factor
PA1 init 90.0
resub 94.0
final 93.0
PA2 init 60.0
resub 90.0
final 88.0
mean init 75.0
resub 92.0
final 90.5
Append a max as we go with pd.concat
pd.concat([
d.append(d.max().rename((n, 'max')))
for n, d in df.groupby('panum')
])
score
panum which
PA1 A 88
A 80
A 90
max 90
PA2 B 92
B 95
B 99
max 99

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Removing rows below first line that meets threshold in pandas dataframe

I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')

Categories