How to incrementally add linear regression column to pandas dataframe

How to incrementally add linear regression column to pandas dataframe - python

I got an example pandas dataframe like this:
a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
I want to add a new column, linear to the column, which is the linear regression output of fit a on b. Now I got:
from sklearn.linear_model import LinearRegression
repr = LinearRegression()
repr.fit(df['a'].as_matrix().reshape(-1,1),df['b'].as_matrix().reshape(-1,1))
repr.predict(df['a'].as_matrix().reshape(-1,1)) # This will give the linear regression outcome for whole column
Now I want to incrementally do linear regression on series a, so the first entry of linear will be b[0], and the second will be b[0]/a[0]*a[1], and the third will be the linear regression outcome of the first two entries, and so on and so forth. I have no clue how to do that with pandas, except for iterating through all the entries, is there a batter way?

You can use expanding with some custom apply functions. Interesting way to do LR...
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_table(StringIO(""" a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
10 10.0 0.4
11 11.0 0.35
12 12.0 0.3
13 13.0 0.28
14 14.0 0.27
15 15.0 0.22"""), sep='\s+')
df = df.sort_values(by='a')
ax = df.plot(x='a',y='b',kind='scatter')
m, b = np.polyfit(df['a'],df['b'],1)
lin_reg = lambda x, m, b : m*x + b
df['lin'] = lin_reg(df['a'], m, b)
def make_m(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[0]
def make_b(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[1]
df['new'] = df['a'].expanding().apply(make_m, raw=True)*df['a'] + df['a'].expanding().apply(make_b, raw=True)
# df = df.sort_values(by='a')
ax.plot(df.a,df.lin)
ax.plot(df.a,df.new)

Related

How to apply a function between two pandas data frames

How can a custom function be applied to two data frames? The .apply method seems to iterate over rows or columns of a given dataframe, but I am not sure how to use this over two data frames at once. For example,
df1
m1 m2
x y x y z
0 0 10.0 12.0 16.0 17.0 9.0
0 10.0 13.0 15.0 12.0 4.0
1 0 11.0 14.0 14.0 11.0 5.0
1 3.0 14.0 12.0 10.0 9.0
df2
m1 m2
x y x y
0 0.5 0.1 1 0
In general, how can a mapping function of df1 to df2 make a new df3. For example, multiply (but I am looking for a generalized solution where I can just send to a function).
def custFunc(d1,d2):
return (d1 * d2) - d2
df1.apply(lambda x: custFunc(x,df2[0]),axis=1)
#df2[0] meaning it is explicitly first row
and a df3 would be
m1 m2
x y x y z
0 0 5.5 1.3 16.0 0.0 9.0
0 5.5 1.4 15.0 0.0 4.0
1 0 6.0 1.5 14.0 0.0 5.0
1 2.0 1.5 12.0 0.0 9.0

If need your function only pass DataFrame and Series with seelecting by row with DataFrame.loc, last for replace missing values by original is use DataFrame.fillna:
def custFunc(d1,d2):
return (d1 * d2) - d2
df = custFunc(df1, df2.loc[0]).fillna(df1)
print (df)
m1 m2
x y x y z
0 0 4.5 1.1 15.0 0.0 9.0
0 4.5 1.2 14.0 0.0 4.0
1 0 5.0 1.3 13.0 0.0 5.0
1 1.0 1.3 11.0 0.0 9.0
Detail:
print (df2.loc[0])
m1 x 0.5
y 0.1
m2 x 1.0
y 0.0
Name: 0, dtype: float64

Organizing results of experiments with pandas

I have the following input data. Each line is the result of one experiment:
instance algo profit time
x A 10 0.5
y A 20 0.1
z A 13 0.7
x B 39 0.9
y B 12 1.2
z B 14 0.6
And I would like to generate the following table:
A B
instance profit time profit time
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
I have tried using pivot and pivot_table with no success. Is there any way to achieve this result with pandas?

First melt to get'profit' and 'time' in the same column, and then use a pivot table with multiple column levels
(df.melt(id_vars=['instance', 'algo'])
.pivot_table(index='instance', columns=['algo', 'variable'], values='value'))
#algo A B
#variable profit time profit time
#instance
#x 10.0 0.5 39.0 0.9
#y 20.0 0.1 12.0 1.2
#z 13.0 0.7 14.0 0.6

set_index and unstack:
df.set_index(['instance', 'algo']).unstack().swaplevels(1, 0, axis=1)
profit time
algo A B A B
instance
x 10 39 0.5 0.9
y 20 12 0.1 1.2
z 13 14 0.7 0.6
(df.set_index(['instance', 'algo'])
.unstack()
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
Another option is using pivot and swaplevel:
(df.pivot('instance', 'algo', ['profit', 'time'])
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10.0 0.5 39.0 0.9
y 20.0 0.1 12.0 1.2
z 13.0 0.7 14.0 0.6

Prediction using random-forest model in python

I have this three column dataset formatted as in the following
t_stamp,Xval,Ytval
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
How can we predict the current value of Y (the true value) using the last 5 data points of Xval using random forest classifier model of sklearn in Python? Meaning taking [0,0,1,2,3] of Xval column as an input - i want to predict the 5th row value of Ytval. Using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Ytval, x=df[['Xval']],
window_type='rolling', window=5, intercept=True)

You can realize the rolling input data on your own by reforming your data so that each of the last 5 values of X becomes it's own feature:
import pandas as pd
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
data = StringIO("""t_stamp,Xval,Ytval
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10""")
df = pd.read_csv(data)
for i in range(1,6):
df['Xval_t'+str(i)] = df['Xval'].shift(i)
Which yields df:
t_stamp Xval Ytval Xval_t1 Xval_t2 Xval_t3 Xval_t4 Xval_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
Of course, you need to decide on how to handle the NaNs. I just drop them for demonstration purposes.
df.dropna(inplace=True)
X = df[['Xval', 'Xval_t1', 'Xval_t2', 'Xval_t3', 'Xval_t4', 'Xval_t5']].values
y = df['Ytval'].values
reg = RandomForestRegressor()
reg.fit(X,y)
print(reg.predict(X))
Result:
[ 10. 10. 10. 10.]

Pandas join/merge/concat two DataFrames and combine rows of identical key/index [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.
I have two sets of data, structured like so:
>>> A
Time Voltage
1.0 5.1
2.0 5.5
3.0 5.3
4.0 5.4
5.0 5.0
>>> B
Time Current
-1.0 0.5
0.0 0.6
1.0 0.3
2.0 0.4
3.0 0.7
I would like to combine the data columns and merge the 'Time' column together so that I get the following:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1 0.3
2.0 5.5 0.4
3.0 5.3 0.7
4.0 5.4
5.0 5.0
I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
1.0 0.3
2.0 5.5
2.0 0.4
3.0 5.3
3.0 0.7
4.0 5.4
5.0 5.0
You'll note that it did not combine rows with shared 'Time' values.
I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
2.0 5.5
3.0 5.3 0.7
4.0 5.4
5.0 5.0
1.0 0.3
2.0 0.4
...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.
I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:
>>> AB
Time Voltage Time Current
1.0 5.1 -1.0 0.5
2.0 5.5 0.0 0.6
3.0 5.3 1.0 0.3
4.0 5.4 2.0 0.4
5.0 5.0 3.0 0.7
I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!

merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.
A.merge(B, 'outer', 'Time')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.
A.join(B.set_index('Time'), 'Time', 'outer')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
4 -1.0 NaN 0.5
4 0.0 NaN 0.6 
pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.
pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()
Time Voltage Current
0 -1.0 NaN 0.5
1 0.0 NaN 0.6
2 1.0 5.1 0.3
3 2.0 5.5 0.4
4 3.0 5.3 0.7
5 4.0 5.4 NaN
6 5.0 5.0 NaN
combine_first
A.set_index('Time').combine_first(B.set_index('Time')).reset_index()
Time Current Voltage
0 -1.0 0.5 NaN
1 0.0 0.6 NaN
2 1.0 0.3 5.1
3 2.0 0.4 5.5
4 3.0 0.7 5.3
5 4.0 NaN 5.4
6 5.0 NaN 5.0

It should work properly if the Time column is of the same dtype in both DFs:
In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [193]: A.dtypes
Out[193]:
Time float64
Voltage float64
dtype: object
In [194]: B.dtypes
Out[194]:
Time float64
Current float64
dtype: object
Reproducing your problem:
In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 NaN
7 1.0 NaN 0.3
1 2.0 5.5 NaN
8 2.0 NaN 0.4
2 3.0 5.3 NaN
9 3.0 NaN 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time object # <------ NOTE
Current float64
dtype: object
Visually it's hard to distinguish:
In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
In [201]: B
Out[201]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7

Solution found
As per the suggestions below, I had to round the numbers in the 'Time' column prior to merging them, despite the fact that they were both of the same dtype (float64). The suggestion was to round like so:
A = A.assign(A.Time = A.Time.round(4))
But in my actual situation, the column was labeled 'Time, (sec)' (there was punctuation that screwed with the assignment. So instead I used the following line to round it:
A['Time, (sec)'] = A['Time, (sec)'].round(4)
And it worked like a charm. Are there any issues with doing it like that?

Combine two Pandas dataframes, resample on one time column, interpolate

This is my first question on stackoverflow. Go easy on me!
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
Here's some toy data demonstrating what I'm trying to do:
import pandas as pd
import numpy as np
# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1
# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2
df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})
df1 and df2 look like this:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.
UPDATE:
I figured out one possible solution: interpolate the second series first, then append to the first data frame:
from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)
which gives:
df1:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
That works, but I'm still open to other solutions if there's a better way.

If you construct a single DataFrame from Series, using time values as index, like this:
>>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
>>> y1 = pd.Series(t1, index=t1)
>>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
>>> y2 = pd.Series(3*t2, index=t2)
>>> df = pd.DataFrame({'y1': y1, 'y2': y2})
>>> df
y1 y2
0.00 0.0 0.00
0.34 NaN 1.02
0.50 0.5 NaN
1.00 1.0 NaN
1.01 NaN 3.03
1.40 NaN 4.20
1.50 1.5 NaN
1.60 NaN 4.80
1.70 NaN 5.10
2.00 2.0 NaN
2.01 NaN 6.03
You can simply interpolate it, and select only the part where y1 is defined:
>>> df.interpolate('index').reindex(y1)
y1 y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
2.0 2.0 6.0

It's not exactly clear to me how you're getting rid of some of the values in y2, but it seems like if there is more than one for a given timepoint, you only want the first one. Also, it seems like your time values should be in the index. I also added column labels. It looks like this:
import pandas as pd
# evenly spaced times
t1 = [0,0.5,1.0,1.5,2.0]
y1 = t1
# unevenly spaced times
t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01]
# round t2 values to the nearest half
new_t2 = [round(num * 2)/2 for num in t2]
# set y2 values
y2 = [3*z for z in new_t2]
# eliminate entries that have the same index value
for x in range(1, len(new_t2), -1):
if new_t2[x] == new_t2[x-1]:
new_t2.delete(x)
y2.delete(x)
ser1 = pd.Series(y1, index=t1)
ser2 = pd.Series(y2, index=new_t2)
df = pd.concat((ser1, ser2), axis=1)
df.columns = ('Y1', 'Y2')
print df
This prints:
Y1 Y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
1.5 1.5 4.5
1.5 1.5 4.5
2.0 2.0 6.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to incrementally add linear regression column to pandas dataframe - python

Related

How to apply a function between two pandas data frames

Organizing results of experiments with pandas

Prediction using random-forest model in python

Pandas join/merge/concat two DataFrames and combine rows of identical key/index [duplicate]

Combine two Pandas dataframes, resample on one time column, interpolate

Categories

Resources