Creating bins and extracting values for that bin - python

I have a pandas dataframe which looks like
Temperature_lim Factor
0 32 0.95
1 34 1.00
2 36 1.06
3 38 1.10
4 40 1.15
I need to extract factor value for any given temperature , if my current temperature is 31, my factor is 0.95. If my current temp is 33, factor is 1, if my current_temp is 38.5 factor is 1.15. So by giving my current temperature, i would like to know the factor for that temperature.
I can do this using multiple if else statements, but is there any effective way I can do it by creating bins/intervals in pandas or python.
Thank you

Use cut with add -np.inf to values of column Temperature_lim and missing values by last value of Factor value:
df1 = pd.DataFrame({'Temp':[31,33,38.5, 40, 41]})
b = [-np.inf] + df['Temperature_lim'].tolist()
lab = df['Factor']
df1['new'] = pd.cut(df1['Temp'], bins=b, labels=lab, right=False).fillna(lab.iat[-1])
print (df1)
Temp new
0 31.0 0.95
1 33.0 1.00
2 38.5 1.15
3 40.0 1.15
4 41.0 1.15

Related

Take average of range entities and replace it in pandas column

I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64

How to see distribution of values in a dataframe

I have a dataframe that loads a CSV. The csv is like this:
PROFIT STRING
16 A_B_C_D
3 A_D_C
-4 A_D_C
20 A_X_C
10 A_F_S
PROFIT is a float, string is a list of characters. The underscore "_" seperates them, so that A_B_C_D would be A,B,C and D individually.
I'm trying to see the profit distribution by character.
eg:
A:
Total profit = 16+3-4+20+10 = 45
Mean = xxx
Median = yyy
B:
Total profit = 16+3 = 19
Mean = zzzz
etc...
Can this be done using pandas, and if so how?
Split and explode by column STRING, then do groupby + agg on column PROFIT
df.assign(STRING=df['STRING'].str.split('_'))\
.explode('STRING').groupby('STRING')['PROFIT'].agg(['sum', 'mean', 'median'])
sum mean median
STRING
A 45 9.00 10.0
B 16 16.00 16.0
C 35 8.75 9.5
D 15 5.00 3.0
F 10 10.00 10.0
S 10 10.00 10.0
X 20 20.00 20.0

Doing Sum and Mean on different columns to generate a grand total in pandas

I have a dataframe, something like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
I want to have a total row where 'perc' has a mean aggregation and 'score' has a sum aggregation. The output should be like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
total 0.3 219
I want it as a dataframe output as I need to build plots using this. For now, I tried doing
df.loc['total'] = df.sum()
but this provides the sum for the percentage column as well, whereas I want an average for the percentage. How to do this in pandas?
try this:
df.loc['total'] = [df['perc'].mean(), df['score'].sum()]
Output:
perc score
name
a 0.20 40.0
b 0.40 89.0
c 0.30 90.0
total 0.30 219.0

Add a column to the original pandas data frame after grouping by 2 columns and taking dot product of two other columns

I have the following data frame in pandas:
I want to add the Avg Price column in the original data frame, after grouping by (Date,Issuer) and then taking the dot product of weights and price, so that it is something like:
Is there a way to do it without using merge or join ? What would be the simplest way to do it?
One way using pandas.DataFrame.prod:
df["Avg Price"] = df[["Weights", "Price"]].prod(1)
df["Avg Price"] = df.groupby(["Date", "Issuer"])["Avg Price"].transform("sum")
print(df)
Output:
Date Issuer Weights Price Avg Price
0 2019-11-12 A 0.4 100 120.0
1 2019-15-12 B 0.5 100 100.0
2 2019-11-12 A 0.2 200 120.0
3 2019-15-12 B 0.3 100 100.0
4 2019-11-12 A 0.4 100 120.0
5 2019-15-12 B 0.2 100 100.0

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Categories