Find positive and negative bin limits based on multiple other columns

Find positive and negative bin limits based on multiple other columns - python

I have a dataframe like as shown below
ID raw_val var_name constant s_value
1 388 Qty 0.36 -0.032
2 120 Qty 0.36 -0.007
3 34 Qty 0.36 0.16
4 45 Qty 0.36 0.31
1 110 F1 0.36 -0.232
2 1000 F1 0.36 -0.17
3 318 F1 0.36 0.26
4 419 F1 0.36 0.31
My objective is to
a) Find the upper and lower limits (of raw_val) for each value of var_name for s_value >=0
b) Find the upper and lower limits (of raw_val) for each value of var_name for s_value <0
I tried the below
df['sign'] = np.where[df['s_value']<0, 'neg', 'pos']
s = df.groupby(['var_name','sign'])['raw_val'].series
df['buckets'] = pd.IntervalIndex.from_arrays(s)
Please note that my real data is big data and has more than 200 unique values for var_name column. The distribution of positive and negative values (s_value) may be uneven for each value of the var_name columns. In sample df, I have shown even distribution of pos and neg values but it may not be the case in real life.
I expect my output to be like as below
var_name sign low_limit upp_limit
Qty neg 120 388
F1 neg 110 1000
Qty pos 34 45
Qty pos 318 419

I think numpy.where with aggregate minimal and maximal values is way:
df['sign'] = np.where(df['s_value']<0, 'neg', 'pos')
df1 = (df.groupby(['var_name','sign'], sort=False, as_index=False)
.agg(low_limit=('raw_val','min'), upp_limit=('raw_val','max')))
print (df1)
var_name sign low_limit upp_limit
0 Qty neg 120 388
1 Qty pos 34 45
2 F1 neg 110 1000
3 F1 pos 318 419

Related

Pythion: Conditional_join janitor package

Hi I want to do a lookup to get the factor value for my dataset based on 3 conditions. Below is the lookup table:
Lookup_Table = {'State_Cd': ['TX','TX','TX','TX','CA','CA','CA','CA'],
'Deductible': [0,0,1000,1000,0,0,1000,1000],
'Revenue_1': [-99999999,25000000,-99999999,25000000,-99999999,25000000,-99999999,25000000],
'Revenue_2': [24999999,99000000,24999999,99000000,24999999,99000000,24999999,99000000],
'Factor': [0.15,0.25,0.2,0.3,0.11,0.15,0.13,0.45]
}
Lookup_Table = pd.DataFrame(Lookup_Table, columns = ['State_Cd','Deductible','Revenue_1','Revenue_2','Factor'])
lookup output:
Lookup_Table
State_Cd Deductible Revenue_1 Revenue_2 Factor
0 TX 0 -99999999 24999999 0.15
1 TX 0 25000000 99000000 0.25
2 TX 1000 -99999999 24999999 0.20
3 TX 1000 25000000 99000000 0.30
4 CA 0 -99999999 24999999 0.11
5 CA 0 25000000 99000000 0.15
6 CA 1000 -99999999 24999999 0.13
7 CA 1000 25000000 99000000 0.45
And then below is my dataset.
Dataset = {'Policy': ['A','B','C'],
'State': ['CA','TX','TX'],
'Deductible': [0,1000,0],
'Revenue': [10000000,30000000,1000000]
}
Dataset = pd.DataFrame(Dataset, columns = ['Policy','State','Deductible','Revenue'])
Dataset output:
Dataset
Policy State Deductible Revenue
0 A CA 0 1500000
1 B TX 1000 30000000
2 C TX 0 1000000
So basically to do the lookup the State must be matching to the State_Cd in lookup table, Deductible should be matching on the deductible in the lookup table, and lastly for Revenue it should be in between Revenue_1 and Revenue_2 (Revenue_1<=Revenue<=Revenue_2). To get to the desired factor value. Below is my expected output:
Policy State Deductible Revenue Factor
0 A CA 0 1500000 0.11
1 B TX 1000 30000000 0.30
2 C TX 0 1000000 0.15
I'm trying the conditional_join from janitor package. However I'm having an error. Is there something missing on my code?
import janitor
Data_Final = (Dataset.conditional_join(Lookup_Table,
# variable arguments
# col_from_left_df, col_from_right_df, comparator
('Revenue', 'Revenue_1', '>='),
('Revenue', 'Revenue_2', '<='),
('State', 'State_Cd', '=='),
('Deductible', 'Deductible', '=='),
how = 'left',sort_by_appearance = False
))
Below is the error
TypeError: __init__() got an unexpected keyword argument 'copy'

Resolved. By installing older version of pandas (less than 1.5). e.g.:
pip install pandas==1.4

how to highlight pandas data frame on selected rows

I have the data like this:
df:
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
I would like to color-highlight the data (index 1-5 in this df) by comparing Max and Min of the data (last two rows) to USL and LSL respectively. if Max > USL or Min < LSL, I would like to highlight the corresponding data points as red. if Max == USL or Min == LSL, corresponding data point as yellow and otherwise everything green.
I tried this :
highlight = np.where(df.loc['Max']>df.loc['USL'], 'background-color: red', '')
df.style.apply(lambda _: highlight)
but i get the error:
ValueError: Function <function <lambda> at 0x7fb681b601f0> created invalid index labels.
Usually, this is the result of the function returning a Series which contains invalid labels, or returning an incorrectly shaped, list-like object which cannot be mapped to labels, possibly due to applying the function along the wrong axis.
Result index has shape: (5,)
Expected index shape: (10,)
Out[58]:
<pandas.io.formats.style.Styler at 0x7fb681b52e20>

Use custom function for create DataFrame of styles by conditions:
#changed data for test
print (df)
A-A A-B A-C A-D
Tg 0.37 10.24 5.02 0.63
USL 0.39 10.26 5.04 0.65
LSL 0.33 0.22 5.00 10.63
1 0.35 10.23 5.05 0.65
2 0.36 10.19 5.07 0.67
3 0.34 10.25 5.03 0.66
4 0.35 10.20 5.08 0.69
5 0.33 10.17 5.05 0.62
Max 0.36 10.25 5.08 0.69
Min 0.33 10.17 5.03 0.62
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
r = [1,2,3,4,5]
m1 = (x.loc['Max']>x.loc['USL']) | (x.loc['Min']<x.loc['LSL'])
print (m1)
m2 = (x.loc['Max']==x.loc['USL']) | (x.loc['Min']==x.loc['LSL'])
print (m2)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)
EDIT: For compare 1-5 and Min/Max use:
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
# r = [1,2,3,4,5]
r += ['Max','Min']
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)

How to fill missing values in a column based on another column values in a Pandas dataframe?

I am currently working with the following data.
import pandas as pd
import io
csv_data = '''
gender,age,Cr,eGFR
1,76,0.56,60.7
1,50,0.76, 70.6
2,64,0.62,55.9
1,62,0.45,Nan
1,68,0.88,80.2
2,69,0.65,Nan
1,70,0.64,62.8
2,65,0.39,60.2
'''
df = pd.read_csv(io.StringIO(csv_data))
gender = 1 is male and 2 is female.
This time, there are two missing values.
If it is a male：
eGFR = 194 × Cr - 1.094 × age - 0.287
If female:
eGFR = 194 × Cr - 1.094 × age - 0.287 × 0.739
I want to fill in the missing values as indicated above.

Use Series.fillna with Series created by Series.where by s1, s2:
#if necessary
df = df.replace('Nan', pd.NA)
s1 = 194 * df.Cr - 1.094 * df.age - 0.287
s2 = 194 * df.Cr - 1.094 * df.age - 0.287 * 0.739
df['eGFR'] = df['eGFR'].fillna(s1.where(df['gender'].eq(1), s2))
print (df)
gender age Cr eGFR
0 1 76 0.56 60.700000
1 1 50 0.76 70.600000
2 2 64 0.62 55.900000
3 1 62 0.45 19.185000
4 1 68 0.88 80.200000
5 2 69 0.65 50.401907
6 1 70 0.64 62.800000
7 2 65 0.39 60.200000

Few notes: Gender values are 1 and 2. I assumed 1=male and 2=female. There are Nan strings which are not NaN values.
# replace female value by its multiplier of the filling function
df['eGFR'] = df.eGFR.replace('Nan', pd.NA).fillna(194 * df.Cr - 1.094 * df.age - 0.287 * df.gender.where(lambda x: x==1, 0.739)).astype(float)
df

One option is with case_when from pyjanitor, which emulates a case_when in SQL, or ifelse/multiple ifelse in python.
Do note that this is just an option; since the task is filling nulls, the idiomatic route would be a fillna and should be faster.
# pip install pyjanitor
import pandas as pd
import janitor
df = df.transform(pd.to_numeric, errors = 'coerce')
df.case_when(
# condition, result
df.eGFR.isna() & df.gender.eq(1), 194 * df.Cr - 1.094 * df.age - 0.287,
df.eGFR.isna() & df.gender.eq(2), 194 * df.Cr - 1.094 * df.age - 0.287 * 0.739,
df.eGFR, # default
column_name = 'eGFR')
gender age Cr eGFR
0 1 76 0.56 60.700000
1 1 50 0.76 70.600000
2 2 64 0.62 55.900000
3 1 62 0.45 19.185000
4 1 68 0.88 80.200000
5 2 69 0.65 50.401907
6 1 70 0.64 62.800000
7 2 65 0.39 60.200000
You can use strings for the conditions section, as long as it can be evaluated by pd.eval. Note that pd.eval might be slower than your regular bracket access option:
df.case_when(
# condition, result
'eGFR.isna() and gender==1', 194 * df.Cr - 1.094 * df.age - 0.287,
'eGFR.isna() and gender==2', 194 * df.Cr - 1.094 * df.age - 0.287 * 0.739,
df.eGFR, # default
column_name = 'eGFR')
gender age Cr eGFR
0 1 76 0.56 60.700000
1 1 50 0.76 70.600000
2 2 64 0.62 55.900000
3 1 62 0.45 19.185000
4 1 68 0.88 80.200000
5 2 69 0.65 50.401907
6 1 70 0.64 62.800000
7 2 65 0.39 60.200000

Change normalized integer values to categories for classification

I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help

Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]

Finding values based on specific categories

I was wondering how would find estimated values based on several different categories. Two of the columns are categorical, one of the other columns contains two strings of interest and the last contain numeric values
I have a csv file called sports.csv
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
I'm trying to find a suggested price for a Gym that have both baseball and basketball as well as enrollment from 240 to 260 given they are from region 4 and of type 1
Region Type enroll estimates price Gym
2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4 2 100 0.26 37 Baseball|Tennis
4 1 347 0.65 61 Basketball|Baseball|Ballet
4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1 1 286 0.74 78 Swimming|Basketball
0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0 1 263 0.91 31 Tennis
2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
0 1 109 0.12 17 Football|Hockey|Volleyball
I don't know how to piece everything together. I apologize if the syntax is incorrect I'm just beginning Python. So far I have:
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')
price = df.loc[df['Gym'], 'Price']
Should I do a groupby instead? If so, how would I include the columns Type==1 Region ==4 and enrollment from 240 to 260 ?

You can create a mask with all your conditions specified and then use the mask for subsetting:
mask = (df['Region'] == 4) & (df['Type'] == 1) & \
(df['enroll'] <= 260) & (df['enroll'] >= 240) & \
df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')
df['price'][mask]
# Series([], name: price, dtype: int64)
which returns empty, since there is no record satisfying all conditions as above.

I had to add an instance that would actually meet your criteria, or else you will get an empty result. You want to use df.loc with conditions as follows:
In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region Type enroll estimates price Gym
...: 2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
...: 4 2 100 0.26 37 Baseball|Tennis
...: 4 1 247 0.65 61 Basketball|Baseball|Ballet
...: 4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
...: 1 1 286 0.74 78 Swimming|Basketball
...: 0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
...: 0 1 263 0.91 31 Tennis
...: 2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
...: 3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
...: 0 1 109 0.12 17 Football|Hockey|Volleyball""")
In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")
In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)")
...: & (df.enroll <= 260) & (df.enroll >= 240)
...: & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]:
2 61
Name: price, dtype: int64
Note I used a regex pattern for contains that essentially acts as an AND operator for regex. You could simply have done another conjunction of .contains conditions for Basketball and Baseball.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find positive and negative bin limits based on multiple other columns - python

Related

Pythion: Conditional_join janitor package

how to highlight pandas data frame on selected rows

How to fill missing values in a column based on another column values in a Pandas dataframe?

Change normalized integer values to categories for classification

Finding values based on specific categories

Categories

Resources