multiplying column of the file by exponential function - python

I,m struggling with multiplying one column file by an exponential function
so my equation is
y=10.43^(-x/3.0678)+0.654
The first values in the column are my X in the equation, so far I was able to multiply only by scalars but with exponential functions not
the file looks like this
8.09
5.7
5.1713
4.74
4.41
4.14
3.29
3.16
2.85
2.52
2.25
2.027
1.7
1.509
0.76
0.3
0.1
So after the calculations, my Y should get these values
8.7 0.655294908
8.09 0.656064021
5.7 0.6668238549
5.1713 0.6732091509
4.74 0.6807096436
4.41 0.6883719253
4.14 0.6962497391
3.29 0.734902438
3.16 0.7433536016
2.85 0.7672424605
2.52 0.7997286905
2.25 0.8331287249
2.027 0.8664148415
1.7 0.926724933
1.509 0.9695896976
0.76 1.213417197
0.3 1.449100509
0.1 1.580418766````
So far this code is working for me but it´s far away from what i want
from scipy.optimize import minimize_scalar
import math
col_list = ["Position"]
df = pd.read_csv("force.dat", usecols=col_list)
print(df)
A = df["Position"]
X = ((-A/3.0678+0.0.654)
print(X)

If I understand it correctly you just want to apply a function to a column in a pandas dataframe, right? If so, you can define the function:
def foo(x):
y = 10.43 ** (-x/3.0678)+0.654
return y
and apply it to df in a new column. If A is the column with the x values, then y will be
df['y'] = df.apply(foo,axis=1)
Now print(df) should give you the example result in your question.

You can do it in one line:
>>> df['y'] = 10.43 ** (- df['x']/3.0678)+0.654
>>> print(df)
x y
0 8.0900 0.656064
1 5.7000 0.666824
2 5.1713 0.673209
3 4.7400 0.680710
4 4.4100 0.688372
5 4.1400 0.696250
6 3.2900 0.734902
7 3.1600 0.743354
8 2.8500 0.767242
9 2.5200 0.799729
10 2.2500 0.833129
11 2.0270 0.866415
12 1.7000 0.926725
13 1.5090 0.969590
14 0.7600 1.213417
15 0.3000 1.449101
16 0.1000 1.580419

Related

Encoding data with LabelEncoder()

I'm having the following dataset as a csv file.
Dataset ecoli.csv:
seq_name,mcg,gvh,lip,chg,aac,alm1,alm2,class
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
(more entries...)
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp
ADI_ECOLI,0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp
My purpose for this dataset is to apply some classification algorithms. In order to handle ecoli.csv file I'm trying to change the class column and put in as first one while seq_name column is dropped. Then I'm printing a test to search for null values. Afterwards I'm plotting with the help of sns library.
Code before error:
column_drop = 'seq_name'
dataframe = pd.read_csv('ecoli.txt', header='infer')
dataframe.drop(column_drop, axis=1, inplace=True) # Dropping columns that I don't need
print(dataframe.isnull().sum())
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True)
plt.show()
Before the encoding, and the error I'm facing, I group the values of the dataset based on class. Finally I'm trying to encode the dataset with LabelEncoder but and an error appears:
Error code:
result = dataframe.groupby(by=("class")).sum().reset_index()
print(result)
le = preprocessing.LabelEncoder()
dataframe.result = le.fit_transform(dataframe.result)
print(result)
Error:
AttributeError: 'DataFrame' object has no attribute 'result'
Update: result is filled with the following index
class mcg gvh lip chg aac alm1 alm2
0 cp 51.99 58.59 68.64 71.5 64.99 44.71 56.52
1 im 36.84 38.24 37.48 38.5 41.28 58.33 56.24
2 imL 1.45 0.94 2.00 1.5 0.91 1.29 1.14
3 imS 1.48 1.02 0.96 1.0 1.07 1.28 1.14
4 imU 25.41 16.06 17.32 17.5 19.56 26.04 26.18
5 om 13.45 14.20 10.12 10.0 14.78 9.25 6.11
6 omL 3.49 2.56 5.00 2.5 2.71 2.82 1.11
7 pp 33.91 36.39 24.96 26.0 22.71 24.34 19.47
Desired output:
Any thoughts?

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.
Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

im facing the following problem and i dont know what is the cleanest/smartest way to solve it.
I have a dataframe called wfm that contains the input for my simulation
wfm.head()
Out[51]:
OPN Vin Vout_ref Pout_ref CFN ... Cdclink Cdm L k ron
0 6 350 750 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
1 7 400 800 92000 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
2 8 350 900 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
3 9 450 750 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
4 10 450 900 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
[5 rows x 13 columns]
then every simulation loop I receive 2 Series outputs_rms and outputs_avg that look like this:
outputs_rms outputs_avg
Out[53]: Out[54]:
time.rms 0.057751 time.avg 5.78E-02
Vi_dc.voltage.rms 400 Vi_dc.voltage.avg 4.00E+02
Vi_dc.current.rms 438.333188 Vi_dc.current.avg 3.81E+02
Vi_dc.power.rms 175333.2753 Vi_dc.power.avg 1.53E+05
Am_in.current.rms 438.333188 Am_in.current.avg 3.81E+02
Cdm.voltage.rms 396.614536 Cdm.voltage.avg 3.96E+02
Cdm.current.rms 0.213185 Cdm.current.avg -5.14E-05
motor_phU.current.rms 566.035833 motor_phU.current.avg -5.67E+02
motor_phU.voltage.rms 296.466083 motor_phU.voltage.avg -9.17E-02
motor_phV.current.rms 0.061024 motor_phV.current.avg 2.58E-02
motor_phV.voltage.rms 1.059341 motor_phV.voltage.avg -1.24E-09
motor_phW.current.rms 566.005071 motor_phW.current.avg 5.67E+02
motor_phW.voltage.rms 297.343876 motor_phW.voltage.avg 9.17E-02
S_ULS.voltage.rms 305.017804 S_ULS.voltage.avg 2.65E+02
S_ULS.current.rms 358.031053 S_ULS.current.avg -1.86E+02
S_UHS.voltage.rms 253.340047 S_UHS.voltage.avg 1.32E+02
S_UHS.current.rms 438.417985 S_UHS.current.avg 3.81E+02
S_VLS.voltage.rms 295.509073 S_VLS.voltage.avg 2.64E+02
S_VLS.current.rms 0 S_VLS.current.avg 0.00E+00
S_VHS.voltage.rms 152.727975 S_VHS.voltage.avg 1.32E+02
S_VHS.current.rms 0.061024 S_VHS.current.avg -2.58E-02
S_WLS.voltage.rms 509.388666 S_WLS.voltage.avg 2.64E+02
S_WLS.current.rms 438.417985 S_WLS.current.avg 3.81E+02
S_WHS.voltage.rms 619.258959 S_WHS.voltage.avg 5.37E+02
S_WHS.current.rms 357.982417 S_WHS.current.avg -1.86E+02
Cdclink.voltage.rms 801.958092 Cdclink.voltage.avg 8.02E+02
Cdclink.current.rms 103.73088 Cdclink.current.avg 2.08E-05
Am_out.current.rms 317.863371 Am_out.current.avg 1.86E+02
Vo_dc.voltage.rms 800 Vo_dc.voltage.avg 8.00E+02
Vo_dc.current.rms 317.863371 Vo_dc.current.avg -1.86E+02
Vo_dc.power.rms 254290.6969 Vo_dc.power.avg -1.49E+05
CFN 1 CFN 1.00E+00
OPN 6 OPN 6.00E+00
dtype: float64 dtype: float64
then my goal is to place outputs_rms and outputs_avg on the right line of wfm, based on 'CFN' and 'OPN' values.
what is your suggestions?
thanks
Riccardo
Suppose that you create these series as outputs output_rms_1, output_rms_2, etc.,
than the series can be combined in one dataframe
import pandas as pd
dfRms = pd.DataFrame([output_rms_1, output_rms_2, output_rms_3])
Next output, say output_rms_10, can simply be added by using:
dfRms = dfRms.append(output_rms_10, ignore_index=True)
Finally, when all outputs are joined into one Dataframe,
you can merge the original wfm with the output, i.e.
result = pd.merge(wfm, dfRms, on=['CFN', 'OPN'], how='left')
Similarly for avg.

TypeError: len() of unsized object when slicing a dataframe - different float formats?

I am trying to slice a dataframe based on some previously defined conditions defined in a separate array. When looping through that array to find the relevant slices of the dataframe, I run into a problem. The first iteration works fine, but the loop breaks during the second iteration, throwing TypeError: len() of unsized object.
Here is an example dataframe:
std sterr Z smooth
0 5.1 2.28 0 7.640484
1 5.13 2.29 0.1 7.532409
2 5.15 2.3 0.21 7.406423
3 5.17 2.31 0.31 7.267842
4 5.19 2.32 0.42 7.121988
5 5.21 2.33 0.52 6.974179
6 5.23 2.34 0.62 6.829734
7 5.25 2.35 0.73 6.693973
8 5.27 2.36 0.83 6.584009
9 5.29 2.37 0.94 6.49429
10 5.31 2.38 1.04 6.427032
Here is the code of the loop:
turnz = df.ix[np.array(turn_iloc), 'Z']
c = 0.
print "turn points", np.array(turnz)
for i, zi in enumerate(np.array(turnz)):
z0 = c
print z0, zi, type(z0), type(zi)
x = df.loc[((z0<=df['Z'])& (df['Z']<=zi)), 'Z']
y = df.loc[((z0<=df['Z'])& (df['Z']<=zi)), 'smooth']
print len(x), len(y)
print type(x), type(y)
c = zi
And these are the printed outputs:
turn points [ 1.04 2.19 2.5 4.06]
0.0 1.04 <type 'float'> <type 'numpy.float64'>
11 11
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
1.04 2.19 <type 'numpy.float64'> <type 'numpy.float64'>
after this, it throws the error.
However, if I try to slice the dataframe using these printed values outside the loop, it works fine.
print "IS IT",df.loc[((1.04<=df['Z'])& (df['Z']<=2.19)), 'Z']
prints
IS IT 10 1.04
11 1.14
12 1.25
13 1.35
14 1.46
15 1.56
16 1.67
17 1.77
18 1.87
19 1.98
20 2.08
21 2.19
Name: Z, dtype: float64
What am I missing?
The complete traceback is below, if it helps:
TypeError Traceback (most recent call last)
<ipython-input-18-b6d427f5dae7> in <module>()
9 z0 = c
10 print z0, zi, type(z0), type(zi)
---> 11 x = df.loc[((z0<=df['Z'])& (df['Z']<=zi)), 'Z']
12 y = df.loc[((z0<=df['Z'])& (df['Z']<=zi)), 'smooth']
13 print len(x), len(y)
C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
739 return NotImplemented
740 elif isinstance(other, (np.ndarray, pd.Index)):
--> 741 if len(self) != len(other):
742 raise ValueError('Lengths must match to compare')
743 return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: len() of unsized object
OUTCOME
it turns out my dataframe has troubles slicing with numpy floats. converting z0 and zi to float solves the problem!
The key to this problem lies in the printout of the first iteration:
0.0 1.04 <type 'float'> <type 'numpy.float64'>
the first iteration has float as an input and it works. The remaining values, however, are elements of a numpy array and are formatted as numpy floats. This is what the dataframe does not accept.
z0 = float(z0)
zi = float(zi)
does the trick.
Now, the question is... why?
If we print turnz.dtype and df['Z'] dtype, both are float64, so they seem to be the same. But python treats them differently, as explained in this answer

Find shared sub-ranges defined by start and endpoints in pandas dataframe

I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100

Categories