Encoding data with LabelEncoder() - python

I'm having the following dataset as a csv file.
Dataset ecoli.csv:
seq_name,mcg,gvh,lip,chg,aac,alm1,alm2,class
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
(more entries...)
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp
ADI_ECOLI,0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp
My purpose for this dataset is to apply some classification algorithms. In order to handle ecoli.csv file I'm trying to change the class column and put in as first one while seq_name column is dropped. Then I'm printing a test to search for null values. Afterwards I'm plotting with the help of sns library.
Code before error:
column_drop = 'seq_name'
dataframe = pd.read_csv('ecoli.txt', header='infer')
dataframe.drop(column_drop, axis=1, inplace=True) # Dropping columns that I don't need
print(dataframe.isnull().sum())
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True)
plt.show()
Before the encoding, and the error I'm facing, I group the values of the dataset based on class. Finally I'm trying to encode the dataset with LabelEncoder but and an error appears:
Error code:
result = dataframe.groupby(by=("class")).sum().reset_index()
print(result)
le = preprocessing.LabelEncoder()
dataframe.result = le.fit_transform(dataframe.result)
print(result)
Error:
AttributeError: 'DataFrame' object has no attribute 'result'
Update: result is filled with the following index
class mcg gvh lip chg aac alm1 alm2
0 cp 51.99 58.59 68.64 71.5 64.99 44.71 56.52
1 im 36.84 38.24 37.48 38.5 41.28 58.33 56.24
2 imL 1.45 0.94 2.00 1.5 0.91 1.29 1.14
3 imS 1.48 1.02 0.96 1.0 1.07 1.28 1.14
4 imU 25.41 16.06 17.32 17.5 19.56 26.04 26.18
5 om 13.45 14.20 10.12 10.0 14.78 9.25 6.11
6 omL 3.49 2.56 5.00 2.5 2.71 2.82 1.11
7 pp 33.91 36.39 24.96 26.0 22.71 24.34 19.47
Desired output:
Any thoughts?

Related

multiplying column of the file by exponential function

I,m struggling with multiplying one column file by an exponential function
so my equation is
y=10.43^(-x/3.0678)+0.654
The first values in the column are my X in the equation, so far I was able to multiply only by scalars but with exponential functions not
the file looks like this
8.09
5.7
5.1713
4.74
4.41
4.14
3.29
3.16
2.85
2.52
2.25
2.027
1.7
1.509
0.76
0.3
0.1
So after the calculations, my Y should get these values
8.7 0.655294908
8.09 0.656064021
5.7 0.6668238549
5.1713 0.6732091509
4.74 0.6807096436
4.41 0.6883719253
4.14 0.6962497391
3.29 0.734902438
3.16 0.7433536016
2.85 0.7672424605
2.52 0.7997286905
2.25 0.8331287249
2.027 0.8664148415
1.7 0.926724933
1.509 0.9695896976
0.76 1.213417197
0.3 1.449100509
0.1 1.580418766````
So far this code is working for me but it´s far away from what i want
from scipy.optimize import minimize_scalar
import math
col_list = ["Position"]
df = pd.read_csv("force.dat", usecols=col_list)
print(df)
A = df["Position"]
X = ((-A/3.0678+0.0.654)
print(X)
If I understand it correctly you just want to apply a function to a column in a pandas dataframe, right? If so, you can define the function:
def foo(x):
y = 10.43 ** (-x/3.0678)+0.654
return y
and apply it to df in a new column. If A is the column with the x values, then y will be
df['y'] = df.apply(foo,axis=1)
Now print(df) should give you the example result in your question.
You can do it in one line:
>>> df['y'] = 10.43 ** (- df['x']/3.0678)+0.654
>>> print(df)
x y
0 8.0900 0.656064
1 5.7000 0.666824
2 5.1713 0.673209
3 4.7400 0.680710
4 4.4100 0.688372
5 4.1400 0.696250
6 3.2900 0.734902
7 3.1600 0.743354
8 2.8500 0.767242
9 2.5200 0.799729
10 2.2500 0.833129
11 2.0270 0.866415
12 1.7000 0.926725
13 1.5090 0.969590
14 0.7600 1.213417
15 0.3000 1.449101
16 0.1000 1.580419

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.
Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

How to solve NaN values error using Lmfit with Python

I'm trying to fit a set of data taken by an external simulation, and stored in a vector, with the Lmfit library.
Below there's my code:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
from lmfit import Parameters
def DGauss3Par(x,I1,sigma1,sigma2):
I2 = 2.63 - I1
return (I1/np.sqrt(2*np.pi*sigma1))*np.exp(-(x*x)/(2*sigma1*sigma1)) + (I2/np.sqrt(2*np.pi*sigma2))*np.exp(-(x*x)/(2*sigma2*sigma2))
#TAKE DATA
xFull = []
yFull = []
fileTypex = np.dtype([('xFull', np.float)])
fileTypey = np.dtype([('yFull', np.float)])
fDatax = "xValue.dat"
fDatay = "yValue.dat"
xFull = np.loadtxt(fDatax, dtype=fileTypex)
yFull = np.loadtxt(fDatay, dtype=fileTypey)
xGauss = xFull[:]["xFull"]
yGauss = yFull[:]["yFull"]
#MODEL'S DEFINITION
gmodel = Model(DGauss3Par)
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
#PLOTS
plt.plot(xGauss, result3.best_fit, 'y-')
plt.show()
When I run it, I get this error:
File "Overlap.py", line 133, in <module>
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
ValueError: The input contains nan values
These are the values of the data contained in the vector xGauss (related to the x axis):
[-3.88 -3.28 -3.13 -3.08 -3.03 -2.98 -2.93 -2.88 -2.83 -2.78 -2.73 -2.68
-2.63 -2.58 -2.53 -2.48 -2.43 -2.38 -2.33 -2.28 -2.23 -2.18 -2.13 -2.08
-2.03 -1.98 -1.93 -1.88 -1.83 -1.78 -1.73 -1.68 -1.63 -1.58 -1.53 -1.48
-1.43 -1.38 -1.33 -1.28 -1.23 -1.18 -1.13 -1.08 -1.03 -0.98 -0.93 -0.88
-0.83 -0.78 -0.73 -0.68 -0.63 -0.58 -0.53 -0.48 -0.43 -0.38 -0.33 -0.28
-0.23 -0.18 -0.13 -0.08 -0.03 0.03 0.08 0.13 0.18 0.23 0.28 0.33
0.38 0.43 0.48 0.53 0.58 0.63 0.68 0.73 0.78 0.83 0.88 0.93
0.98 1.03 1.08 1.13 1.18 1.23 1.28 1.33 1.38 1.43 1.48 1.53
1.58 1.63 1.68 1.73 1.78 1.83 1.88 1.93 1.98 2.03 2.08 2.13
2.18 2.23 2.28 2.33 2.38 2.43 2.48 2.53 2.58 2.63 2.68 2.73
2.78 2.83 2.88 2.93 2.98 3.03 3.08 3.13 3.28 3.88]
And these ones the ones in the vector yGauss (related to y axis):
[0.00173977 0.00986279 0.01529543 0.0242624 0.0287456 0.03238484
0.03285927 0.03945234 0.04615091 0.05701618 0.0637672 0.07194268
0.07763934 0.08565687 0.09615262 0.1043281 0.11350606 0.1199406
0.1260062 0.14093328 0.15079665 0.16651464 0.18065023 0.1938894
0.2047541 0.21794024 0.22806706 0.23793043 0.25164404 0.2635118
0.28075974 0.29568682 0.30871501 0.3311846 0.34648062 0.36984661
0.38540666 0.40618835 0.4283945 0.45002014 0.48303911 0.50746062
0.53167057 0.5548792 0.57835128 0.60256181 0.62566436 0.65704847
0.68289386 0.71332794 0.73258027 0.769608 0.78769989 0.81407275
0.83358852 0.85210239 0.87109068 0.89456217 0.91618782 0.93760247
0.95680234 0.96919757 0.9783219 0.98486193 0.9931429 0.9931429
0.98486193 0.9783219 0.96919757 0.95680234 0.93760247 0.91618782
0.89456217 0.87109068 0.85210239 0.83358852 0.81407275 0.78769989
0.769608 0.73258027 0.71332794 0.68289386 0.65704847 0.62566436
0.60256181 0.57835128 0.5548792 0.53167057 0.50746062 0.48303911
0.45002014 0.4283945 0.40618835 0.38540666 0.36984661 0.34648062
0.3311846 0.30871501 0.29568682 0.28075974 0.2635118 0.25164404
0.23793043 0.22806706 0.21794024 0.2047541 0.1938894 0.18065023
0.16651464 0.15079665 0.14093328 0.1260062 0.1199406 0.11350606
0.1043281 0.09615262 0.08565687 0.07763934 0.07194268 0.0637672
0.05701618 0.04615091 0.03945234 0.03285927 0.03238484 0.0287456
0.0242624 0.01529543 0.00986279 0.00173977]
I've also tried to print the values returned by my function, to see if there really were some NaN values:
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
func = DGauss3Par(xGauss,I1,sigma1,sigma2)
print func
but what I obtained is:
[0.04835225 0.06938855 0.07735839 0.08040181 0.08366964 0.08718237
0.09096169 0.09503048 0.0994128 0.10413374 0.10921938 0.11469669
0.12059333 0.12693754 0.13375795 0.14108333 0.14894236 0.15736337
0.16637406 0.17600115 0.18627003 0.19720444 0.20882607 0.22115413
0.23420498 0.24799173 0.26252377 0.27780639 0.29384037 0.3106216
0.32814069 0.34638266 0.3653266 0.38494543 0.40520569 0.42606735
0.44748374 0.46940149 0.49176057 0.51449442 0.5375301 0.56078857
0.58418507 0.60762948 0.63102687 0.65427809 0.6772804 0.69992818
0.72211377 0.74372824 0.76466232 0.78480729 0.80405595 0.82230355
0.83944875 0.85539458 0.87004937 0.88332762 0.89515085 0.90544838
0.91415806 0.92122688 0.92661155 0.93027889 0.93220625 0.93220625
0.93027889 0.92661155 0.92122688 0.91415806 0.90544838 0.89515085
0.88332762 0.87004937 0.85539458 0.83944875 0.82230355 0.80405595
0.78480729 0.76466232 0.74372824 0.72211377 0.69992818 0.6772804
0.65427809 0.63102687 0.60762948 0.58418507 0.56078857 0.5375301
0.51449442 0.49176057 0.46940149 0.44748374 0.42606735 0.40520569
0.38494543 0.3653266 0.34638266 0.32814069 0.3106216 0.29384037
0.27780639 0.26252377 0.24799173 0.23420498 0.22115413 0.20882607
0.19720444 0.18627003 0.17600115 0.16637406 0.15736337 0.14894236
0.14108333 0.13375795 0.12693754 0.12059333 0.11469669 0.10921938
0.10413374 0.0994128 0.09503048 0.09096169 0.08718237 0.08366964
0.08040181 0.07735839 0.06938855 0.04835225]
So it doesn't seems that there are NaN values, I'm not understanding for which reason it returns me that error.
Could anyone help me, please? Thanks!
If you add a print function to your fit function, printing out sigma1 and sigma2, you'll find that
DGauss3Par is evaluated already a few times before the error occurs.
Both sigma variables have a negative value at the time the error occurs.
Taking the square root of a negative value causes, of course, a NaN.
You should add a min bound or similar to your sigma1 and sigma2 parameters to prevent this. Using min=0.0 as an additional argument to params.add(...) will result in a good fit.
Be aware that for some analyses, setting explicit bounds to your fitting parameters may make these analyses invalid. For most cases, you'll be fine, but for some cases, you'll need to check whether the fitting parameters should be allowed to vary from negative infinity to positive infinity, or are allowed to be bounded.

Error building a function to calculate standard deviation

I am new to Python and I am trying to build a function to run some statistics on a data set. The data is in an Excel format and it contains 7 rows, with the first row I know what a function is and how it should be built, nevertheless I can't figure it out how to build this function.
This is the function:
def st_dev(benchmark, factor):
benchmark = mkt_ret
factor = smb
statistics = st.stdev(benchmark, factor)
return statistics
print(st_dev)
And this is the result:
Mkt-RF SMB HML RMW CMA RF
196307 -0.39 -0.46 -0.81 0.72 -1.16 0.27
196308 5.07 -0.81 1.65 0.42 -0.4 0.25
196309 -1.57 -0.48 0.19 -0.8 0.23 0.27
196310 2.53 -1.29 -0.09 2.75 -2.26 0.29
196311 -0.85 -0.85 1.71 -0.34 2.22 0.27
4.38
<function st_dev at 0x0000000002D92F28>
Process finished with exit code 0
the full code can be viewed here.
I tried several versions to write the function, some error messages told me that I cannot convert 'Series' to numerator/denominator.
I am running python 3.7
Thank you for your help.
Alex

Not performing calculation on blank field in dataframe

I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?

Categories