I am trying to aggregate pandas DataFrame and create 2 new columns that would be a slope and an intercept from a simple linear regression fit.
The dummy dataset looks like this:
CustomerID Month Value
a 1 10
a 2 20
a 3 20
b 1 30
b 2 40
c 1 80
c 2 90
And I want the output to look like this - which would regress Value against Month for each CustomerID:
CustomerID Slope Intercept
a 0.30 10
b 0.20 30
c 0.12 80
I know I could run a loop and then for each customerID run the linear regression model, but my dataset is huge and I need a vectorized approach. I tried using groupby and apply by passing linear regression function but didn't find a solution that would work.
Thanks in advance!
By using scpiy with groupby , here I am using for loop rather than apply , since apply is slower than for loop
from scipy import stats
pd.DataFrame.from_dict({y:stats.linregress(x['Month'],x['Value'])[:2] for y, x in df.groupby('CustomerID')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Out[798]:
Slope Intercept
a 5.0 6.666667
b 10.0 20.000000
c 10.0 70.000000
Related
I will like to first create two columns and then use LOG() to calculate the periodic daily returns for column Price and column Adjusted Close. Thereafter using the periodic return to find the correlation between periodic daily returns calculated.
I tried
Combine_data['log_return'] = np.log(1 + Combine_data.pct_change)
Combine_data.head()
but it is not working.
Combine_data= pd.merge(XAU_USD,SP500, on='Date',suffixes=
('(GOLD)','(SP500)'))
Combine_data.set_index('Date',inplace=True)
Combine_data.head()
This is what my output looks like:
You can try that as follows:
ser1= (df['gold']+1).apply(np.log)
ser2= (df['silver']+1).apply(np.log)
np.corrcoef(ser1, ser2)
The result looks like:
Out[431]:
array([[1. , 0.30121126],
[0.30121126, 1. ]])
A correlation of 0.301 is not bad given the fact, the data is randomly generated :-)
Out[430]:
gold silver date
0 793.559641 19.112793 2019-08-23
1 1428.329390 17.758924 2019-08-24
2 1044.061092 17.962435 2019-08-25
3 1222.397539 17.638691 2019-08-26
4 890.945841 11.593497 2019-08-27
5 1224.616916 15.759736 2019-08-28
6 1059.684075 12.900665 2019-08-29
7 1147.011421 20.274250 2019-08-30
8 929.638993 12.244630 2019-08-31
9 515.545695 14.609073 2019-09-01
Here are two methods to do this:
Use Pearsonr correlation:
from scipy.stats.stats import pearsonr
coeff = pearsonr(x, y)[0] # 0 is the coefficient, 1 is the p-value
Use numpy correlation:
import numpy as np
coeff = np.corrcoef(x,y)[0,1]
I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n)) will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled
I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.
I have some data in a pandas dataframe which has a triple multi-index:
Antibody Time Repeats
Customer_Col1A2 0 1 0.657532
2 0.639933
3 0.975302
5 1 0.628196
2 0.663301
3 0.921025
10 1 0.665601
2 0.785324
3 0.697913
My question is, what is the best way to calculate the average and (sample) standard error of the mean for this data (grouped by time point? So the answer for the 0 time point would be (0.657532+0.639933+0.975302)/3=0.757589 for the average and 0.188750216 for the SD. The output would look something like this:
Antibody Time Average sample SD
Customer_Col1A2 0 0.757589 0.188750216
5 .... ....
10 .... ....
Thanks in advance
You can group by the level of multi-index by specifying the level parameter, and calculate the average and SD using DataFrame.mean() and DataFrame.std() methods correspondingly:
df1.groupby(level=[0,1]).agg({'avg': 'mean', 'sd': 'std'})
I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.