Multivariate 'quadratic' regression in python (like fitlm function in matlab) - python

I wanted to ask if anyone can help me out.
I want to create a 'quadratic' regression of 5 input variables in python and obtain a regression quadratic equation.
In matlab I can use the function
fitlm(ds,'quadratic')
ds is a nx5 array.
The output is (example):
Linear regression model:
x6 ~ [Linear formula with 21 terms in 5 predictors]
Estimated Coefficients:
Estimate SE tStat pValue
___________ __________ __________ __________
(Intercept) 3.8574 0.60766 6.348 2.296e-08
x1 0.2847 0.26311 1.0821 0.28316
x2 0.0022534 0.0046868 0.48079 0.63226
x3 -0.0092632 0.010228 -0.9057 0.36839
x4 -0.0039061 0.00043497 -8.9802 4.7159e-13
x5 0.0014984 0.00061604 2.4323 0.017722
x1:x2 -0.004019 0.0014052 -2.8602 0.0056639
x1:x3 -1.1981e-05 0.0021956 -0.0054568 0.99566
x1:x4 0.00011539 0.00011732 0.98356 0.32893
x1:x5 0.00011744 0.00017357 0.67661 0.50102
x2:x3 1.6354e-06 4.3911e-05 0.037243 0.9704
x2:x4 2.9589e-06 2.3464e-06 1.2611 0.21173
x2:x5 3.0621e-06 3.4713e-06 0.8821 0.38092
x3:x4 2.2725e-06 3.6662e-06 0.61986 0.53749
x3:x5 -1.4034e-05 5.7374e-06 -2.446 0.017117
x4:x5 2.5923e-06 2.8928e-07 8.9614 5.0922e-13
x1^2 -0.14307 0.052186 -2.7415 0.0078616
x2^2 -4.5755e-05 2.2194e-05 -2.0616 0.043186
x3^2 2.5903e-05 5.4432e-05 0.47587 0.63574
x4^2 1.1868e-06 1.4496e-07 8.1874 1.2233e-11
x5^2 -2.1103e-05 6.8098e-07 -30.989 4.7528e-41
How can I do the same thing in python?
I tried to use the linear_model.LinearRegression() and PolynomialFeatures() from sklearn but it returned me only 5 terms (linear ones) by now.
I attach some example values.
Row 1-5 contains the parameters, Row6 contains the targets.
x1 x2 x3 x4 x5 x6
1.75 -2.5 76 1050 0 0.99
1 10 84 900 0 1.1598
1.5 10 84 900 100 1.2034
1.5 10 68 900 100 1.3544
1.5 10 84 900 200 0.8591
1.5 10 84 900 200 0.8595
1.25 -2.5 76 1050 100 1.072
1.25 22.5 76 750 200 1.0426
1 10 84 900 200 0.8588
1.25 -2.5 92 750 100 1.3811
1.25 22.5 92 1050 100 1.0213
2 10 84 900 0 1.0336
Thank you very much in advance!
Regards!
AF

Related

How do forward roll on a specific subset of data and while modifying the original dataset?

I'm trying to perform this operation on this dataset. I'm trying to calculate cummulative Sum of the specific subset of the dataset.I want the changes to reflect on real dataset.
. Table below illustrates how I want to calculate Offset.
#OFFSET
min = data.exit_block.min()
max = data.exit_block.max()
temp = 0
data['Offset']
for i in tqdm(range(min,min+10)):
offset = data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size'].sum()
data.loc[data["entry_block"] == i ,'Offset'] = data[data['entry_block']==i]['size'].cumsum() + offset
print(len(data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size']))
print(offset)
print(data[data['entry_block']==i]['size'].cumsum().head() )
print(data[data['entry_block']==i]['size'].head())
break
In the code above I'm creating a dataset B from original dataset and trying to perform of the cummulative sum operation on the origial dataset from the values driven from dataset B.
Index
Entry_block
Exit_block
Size
Offset
1
10
20
10
10
2
11
20
150
160
3
18
20
100
260
4
19
21
40
300
5
20
21
120
120
6
20
21
180
300
7
20
21
210
510
8
20
21
90
600
9
20
21
450
1050

Python PuLP performance issue - taking too much time to solve

I am using pulp to create an allocator function which packs the items in the trucks based on the weight and volume. It works fine(takes 10-15 sec) for 10-15 items but when I double the items it takes more than half hour to solve it.
def allocator(item_mass,item_vol,truck_mass,truck_vol,truck_cost, id_series):
n_items = len(item_vol)
set_items = range(n_items)
n_trucks = len(truck_cost)
set_trucks = range(n_trucks)
print("working1")
y = pulp.LpVariable.dicts('truckUsed', set_trucks,
lowBound=0, upBound=1, cat=LpInteger)
x = pulp.LpVariable.dicts('itemInTruck', (set_items, set_trucks),
lowBound=0, upBound=1, cat=LpInteger)
print("working2")
# Model formulation
prob = LpProblem("Truck allocation problem", LpMinimize)
# Objective
prob += lpSum([truck_cost[i] * y[i] for i in set_trucks])
print("working3")
# Constraints
for j in set_items:
# Every item must be taken in one truck
prob += lpSum([x[j][i] for i in set_trucks]) == 1
for i in set_trucks:
# Respect the mass constraint of trucks
prob += lpSum([item_mass[j] * x[j][i] for j in set_items]) <= truck_mass[i]*y[i]
# Respect the volume constraint of trucks
prob += lpSum([item_vol[j] * x[j][i] for j in set_items]) <= truck_vol[i]*y[i]
print("working4")
# Ensure y variables have to be set to make use of x variables:
for j in set_items:
for i in set_trucks:
x[j][i] <= y[i]
print("working5")
s = id_series #id_series
prob.solve()
print("working6")
This is the data i am running it on
items:
Name Pid Quantity Length Width Height Volume Weight t_type
0 A 1 1 4.60 4.30 4.3 85.05 1500 Open
1 B 2 1 4.60 4.30 4.3 85.05 1500 Open
2 C 3 1 6.00 5.60 9.0 302.40 10000 Container
3 D 4 1 8.75 5.60 6.6 441.00 1000 Open
4 E 5 1 6.00 5.16 6.6 204.33 3800 Open
5 C 6 1 6.00 5.60 9.0 302.40 10000 All
6 C 7 1 6.00 5.60 9.0 302.40 10000 Container
7 D 8 1 8.75 5.60 6.6 441.00 6000 Open
8 E 9 1 6.00 5.16 6.6 204.33 3800 Open
9 C 10 1 6.00 5.60 9.0 302.40 10000 All
.... times 5
trucks(this just the top 5 rows, I have 54 types of trucks in total):
Category Name TruckID Length(ft) Breadth(ft) Height(ft) Volume \
0 LCV Tempo 407 0 9.5 5.5 5.5 287.375
1 LCV Tempo 407 1 9.5 5.5 5.5 287.375
2 LCV Tempo 407 2 9.5 5.5 5.5 287.375
3 LCV 13 Feet 3 13.0 5.5 7.0 500.500
4 LCV 14 Feet 4 14.0 6.0 6.0 504.000
Weight Price
0 1500 1
1 2000 1
2 2500 2
3 3500 3
4 4000 3
where ItemId is this:
data["ItemId"] = data.index + 1
id_series = data["ItemId"].tolist()
PuLP can handle multiple solvers. See what ones you have with:
pulp.pulpTestAll()
This will give a list like:
Solver pulp.solvers.PULP_CBC_CMD unavailable.
Solver pulp.solvers.CPLEX_DLL unavailable.
Solver pulp.solvers.CPLEX_CMD unavailable.
Solver pulp.solvers.CPLEX_PY unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.COIN_CMD passed.
Solver pulp.solvers.COINMP_DLL unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.GLPK_CMD passed.
Solver pulp.solvers.XPRESS unavailable.
Solver pulp.solvers.GUROBI unavailable.
Solver pulp.solvers.GUROBI_CMD unavailable.
Solver pulp.solvers.PYGLPK unavailable.
Solver pulp.solvers.YAPOSIB unavailable.
You can then solve using, e.g.:
lp_prob.solve(pulp.COIN_CMD())
Gurobi and CPLEX are commercial solvers that tend to work quite well. Perhaps you could access them? Gurobi has a good academic license.
Alternatively, you may wish to look into an approximate solution, depending on your quality constraints.

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

im facing the following problem and i dont know what is the cleanest/smartest way to solve it.
I have a dataframe called wfm that contains the input for my simulation
wfm.head()
Out[51]:
OPN Vin Vout_ref Pout_ref CFN ... Cdclink Cdm L k ron
0 6 350 750 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
1 7 400 800 92000 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
2 8 350 900 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
3 9 450 750 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
4 10 450 900 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
[5 rows x 13 columns]
then every simulation loop I receive 2 Series outputs_rms and outputs_avg that look like this:
outputs_rms outputs_avg
Out[53]: Out[54]:
time.rms 0.057751 time.avg 5.78E-02
Vi_dc.voltage.rms 400 Vi_dc.voltage.avg 4.00E+02
Vi_dc.current.rms 438.333188 Vi_dc.current.avg 3.81E+02
Vi_dc.power.rms 175333.2753 Vi_dc.power.avg 1.53E+05
Am_in.current.rms 438.333188 Am_in.current.avg 3.81E+02
Cdm.voltage.rms 396.614536 Cdm.voltage.avg 3.96E+02
Cdm.current.rms 0.213185 Cdm.current.avg -5.14E-05
motor_phU.current.rms 566.035833 motor_phU.current.avg -5.67E+02
motor_phU.voltage.rms 296.466083 motor_phU.voltage.avg -9.17E-02
motor_phV.current.rms 0.061024 motor_phV.current.avg 2.58E-02
motor_phV.voltage.rms 1.059341 motor_phV.voltage.avg -1.24E-09
motor_phW.current.rms 566.005071 motor_phW.current.avg 5.67E+02
motor_phW.voltage.rms 297.343876 motor_phW.voltage.avg 9.17E-02
S_ULS.voltage.rms 305.017804 S_ULS.voltage.avg 2.65E+02
S_ULS.current.rms 358.031053 S_ULS.current.avg -1.86E+02
S_UHS.voltage.rms 253.340047 S_UHS.voltage.avg 1.32E+02
S_UHS.current.rms 438.417985 S_UHS.current.avg 3.81E+02
S_VLS.voltage.rms 295.509073 S_VLS.voltage.avg 2.64E+02
S_VLS.current.rms 0 S_VLS.current.avg 0.00E+00
S_VHS.voltage.rms 152.727975 S_VHS.voltage.avg 1.32E+02
S_VHS.current.rms 0.061024 S_VHS.current.avg -2.58E-02
S_WLS.voltage.rms 509.388666 S_WLS.voltage.avg 2.64E+02
S_WLS.current.rms 438.417985 S_WLS.current.avg 3.81E+02
S_WHS.voltage.rms 619.258959 S_WHS.voltage.avg 5.37E+02
S_WHS.current.rms 357.982417 S_WHS.current.avg -1.86E+02
Cdclink.voltage.rms 801.958092 Cdclink.voltage.avg 8.02E+02
Cdclink.current.rms 103.73088 Cdclink.current.avg 2.08E-05
Am_out.current.rms 317.863371 Am_out.current.avg 1.86E+02
Vo_dc.voltage.rms 800 Vo_dc.voltage.avg 8.00E+02
Vo_dc.current.rms 317.863371 Vo_dc.current.avg -1.86E+02
Vo_dc.power.rms 254290.6969 Vo_dc.power.avg -1.49E+05
CFN 1 CFN 1.00E+00
OPN 6 OPN 6.00E+00
dtype: float64 dtype: float64
then my goal is to place outputs_rms and outputs_avg on the right line of wfm, based on 'CFN' and 'OPN' values.
what is your suggestions?
thanks
Riccardo
Suppose that you create these series as outputs output_rms_1, output_rms_2, etc.,
than the series can be combined in one dataframe
import pandas as pd
dfRms = pd.DataFrame([output_rms_1, output_rms_2, output_rms_3])
Next output, say output_rms_10, can simply be added by using:
dfRms = dfRms.append(output_rms_10, ignore_index=True)
Finally, when all outputs are joined into one Dataframe,
you can merge the original wfm with the output, i.e.
result = pd.merge(wfm, dfRms, on=['CFN', 'OPN'], how='left')
Similarly for avg.

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Find shared sub-ranges defined by start and endpoints in pandas dataframe

I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100

Categories