Pandas apply polyfit to a series against a value of the series - python

I'm new to the Pandas world and it has been hard to stop thinking sequentially.
I have a Series like:
df['sensor'].head(30)
0 6.8855
1 6.8855
2 6.8875
3 6.8885
4 6.8885
5 6.8895
6 6.8895
7 6.8895
8 6.8905
9 6.8905
10 6.8915
11 6.8925
12 6.8925
13 6.8925
14 6.8925
15 6.8925
16 6.8925
17 6.8925
Name: int_price, dtype: float64
I want to calculate the polynomial fit of the first value against all others to the find and average. I defined a function to do the calculation and I want it to be applied onto the series.
The function:
def linear_trend(a,b):
return np.polyfit([1,2],[a,b],1)
The application:
a = pd.Series(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index)))
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
This returns TypeError: No loop matching the specified signature and casting was found for ufunc lstsq_m.
or this:
a = df_plot['sensor'].iloc[0]
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
That returns ValueError: setting an array element with a sequence.
How can I solve this?

I was able to work around my issue by doing the following:
a = pd.Series(data=(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index))), name='sensor_ref')
df_poly = pd.concat([a,df_plot['sensor']],axis=1)
df_plot['slope'] = df_poly[['sensor_ref','sensor']].apply(lambda df_poly: linear_trend(df_poly['sensor_ref'],df_poly['sensor']), axis=1)
If you have a better method, it's welcome.

Related

Apply() function based on column condition restarting instead of changing

I have the following apply function:
week_to_be_predicted = 15
df['raw_data'] = df.apply(lambda x: get_raw_data(x['boxscore']) if x['week_int']<week_to_be_predicted else 0,axis=1)
df['raw_data'] = df.apply(lambda x: get_initials(x['boxscore']) if x['week_int']==week_to_be_predicted else x['raw_data'],axis=1)
Where df['week_int'] is a column of integers starting from 0 and increasing to 18. If the row value for df['week_int'] < week_to_be_predicted (in this case 15) I want the function get_raw_data to be applied, otherwise I want the function get_initials to be applied.
My question is in regards to troubleshooting the apply() function. The reason is that after successfully applying the get_raw_data to all rows where week_int < 14, instead of putting 0's for the remaining rows of df['raw_data'] (else 0), the "loop" restarts, and it begins from the first row of the dataframe and starts applying the get_raw_data all over again, seemingly stuck in an infinite loop.
What's more confounding, is that it does not always do this. The functions as written initially solved this same problem, and have been working as intended for the past ~10 weeks, but now all the sudden when I set week_to_be_predicted to 15, it is reverting to its old ways.
I'm wondering if this has something to do with the apply() function, the conditions inside the apply function, or both. It's difficult for me to troubleshoot, as the logic has worked in the past. I'm wondering if there is something about apply() that makes this a less than optimal approach, and if anybody knows what aspect might be causing the problem.
Thank you in advance.
Use a boolean mask:
def get_raw_data(sr):
return -sr
def get_initials(sr):
return sr
df = pd.DataFrame({'week_int': np.arange(0, 19),
'boxscore': np.random.random(19)})
m = df['week_int'] < week_to_be_predicted
df.loc[m, 'raw_data'] = get_raw_data(df.loc[m, 'boxscore'])
df.loc[~m, 'raw_data'] = get_initials(df.loc[~m, 'boxscore'])
Output:
>>> df
week_int boxscore raw_data
0 0 0.232081 -0.232081
1 1 0.890318 -0.890318
2 2 0.372760 -0.372760
3 3 0.697202 -0.697202
4 4 0.400200 -0.400200
5 5 0.793784 -0.793784
6 6 0.783359 -0.783359
7 7 0.898331 -0.898331
8 8 0.440433 -0.440433
9 9 0.415760 -0.415760
10 10 0.599502 -0.599502
11 11 0.941613 -0.941613
12 12 0.039865 -0.039865
13 13 0.820617 -0.820617
14 14 0.471396 -0.471396
15 15 0.794547 0.794547
16 16 0.682332 0.682332
17 17 0.638694 0.638694
18 18 0.761995 0.761995

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

how to create new column based on multiple columns with a function

This question is following up to my questionabout linear interpolation between two data points
I built following function from it:
def inter(colA, colB):
s = pd.Series([colA, np.nan, colB], index= [95, 100, 102.5])
s = s.interpolate(method='index')
return s.iloc[1]
Now I have a data frame that looks like this:
on95 on102.5 on105
Index
1 5 17 20
2 7 15 25
3 6 16 23
I would like to create a new column df['new'] that uses the function inter with inputs of on95 and on102.5
I tried like this:
df['new'] = inter(df['on95'],df['on102.5'])
but this resulted in NaN's.
I also tried with apply(inter) but did not find a way to make it work without an error message.
Any hints how I can solve this?
You need to vectorize your self defined function with np.vectorize, since the function parameters are accepted as pandas Series:
inter = np.vectorize(inter)
df['new'] = inter(df['on95'],df['on102.5'])
df
on95 on102.5 on105 new
#Index
# 1 5 17 20 13.000000
# 2 7 15 25 12.333333
# 3 6 16 23 12.666667

Best data structure to use in python to store a 3 dimensional cube of named data

I would like some feedback on my choice of data structure. I have a 2D X-Y grid of current values for a specific voltage value. I have several voltage steps and have organized the data into a cube of X-Y-Voltage. I illustrated the axes here: http://imgur.com/FVbluwB.
I currently use numpy arrays in python dictionaries for the different kind of transistors I am sweeping. I'm not sure if this is the best way to do this. I've looked at Pandas, but am also not sure if this is a good job for Pandas. Was hoping someone could help me out, so I could learn to be pythonic! The code to generate some test data and the end structure is below.
Thank you!
import numpy as np
#make test data
test__transistor_data0 = {"SNMOS":np.random.randn(3,256,256),"SPMOS":np.random.randn(4,256,256), "WPMOS":np.random.randn(6,256,256),"WNMOS":np.random.randn(6,256,256)}
test__transistor_data1 = {"SNMOS":np.random.randn(3,256,256), "SPMOS":np.random.randn(4,256,256), "WPMOS":np.random.randn(6,256,256), "WNMOS":np.random.randn(6,256,256)}
test__transistor_data2 = {"SNMOS":np.random.randn(3,256,256), "SPMOS":np.random.randn(4,256,256), "WPMOS":np.random.randn(6,256,256), "WNMOS":np.random.randn(6,256,256)}
test__transistor_data3 = {"SNMOS":np.random.randn(3,256,256), "SPMOS":np.random.randn(4,256,256), "WPMOS":np.random.randn(6,256,256), "WNMOS":np.random.randn(6,256,256)}
quadrant_data = {"ne":test__transistor_data0,"nw":test__transistor_data1,"sw":test__transistor_data2,"se":test__transistor_data3}
It may be worth checking out xarray, which is like (and partially based on) pandas, but designed for N-dimensional data.
Its two fundamental containers are a DataArray which is a labeled ND array, and a a Dataset, which is a container of DataArrays.
In [29]: s1 = xray.DataArray(np.random.randn(3,256,256), dims=['voltage', 'x', 'y'])
In [30]: s2 = xray.DataArray(np.random.randn(3,256,256), dims=['voltage', 'x', 'y'])
In [32]: ds = xray.Dataset({'SNMOS': s1, 'SPMOS': s2})
In [33]: ds
Out[33]:
<xray.Dataset>
Dimensions: (voltage: 3, x: 256, y: 256)
Coordinates:
* voltage (voltage) int64 0 1 2
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
SPMOS (voltage, x, y) float64 -1.363 2.446 0.3585 -0.8243 -0.814 ...
SNMOS (voltage, x, y) float64 1.07 2.327 -1.435 0.4011 0.2379 2.07 ...
Both containers have a lot of nice functionality (see the docs), for example, if you wanted to know max value of x for each transitor, at the first voltage level, it'd be something like this:
In [39]: ds.sel(voltage=0).max(dim='x').max()
Out[39]:
<xray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
SPMOS float64 4.175
SNMOS float64 4.302

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

Categories