how to create new column based on multiple columns with a function

how to create new column based on multiple columns with a function - python

This question is following up to my questionabout linear interpolation between two data points
I built following function from it:
def inter(colA, colB):
s = pd.Series([colA, np.nan, colB], index= [95, 100, 102.5])
s = s.interpolate(method='index')
return s.iloc[1]
Now I have a data frame that looks like this:
on95 on102.5 on105
Index
1 5 17 20
2 7 15 25
3 6 16 23
I would like to create a new column df['new'] that uses the function inter with inputs of on95 and on102.5
I tried like this:
df['new'] = inter(df['on95'],df['on102.5'])
but this resulted in NaN's.
I also tried with apply(inter) but did not find a way to make it work without an error message.
Any hints how I can solve this?

You need to vectorize your self defined function with np.vectorize, since the function parameters are accepted as pandas Series:
inter = np.vectorize(inter)
df['new'] = inter(df['on95'],df['on102.5'])
df
on95 on102.5 on105 new
#Index
# 1 5 17 20 13.000000
# 2 7 15 25 12.333333
# 3 6 16 23 12.666667

Related

split Python DataFrame into k parts with index and iterate over them in a loop

I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.

As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926

Apply() function based on column condition restarting instead of changing

I have the following apply function:
week_to_be_predicted = 15
df['raw_data'] = df.apply(lambda x: get_raw_data(x['boxscore']) if x['week_int']<week_to_be_predicted else 0,axis=1)
df['raw_data'] = df.apply(lambda x: get_initials(x['boxscore']) if x['week_int']==week_to_be_predicted else x['raw_data'],axis=1)
Where df['week_int'] is a column of integers starting from 0 and increasing to 18. If the row value for df['week_int'] < week_to_be_predicted (in this case 15) I want the function get_raw_data to be applied, otherwise I want the function get_initials to be applied.
My question is in regards to troubleshooting the apply() function. The reason is that after successfully applying the get_raw_data to all rows where week_int < 14, instead of putting 0's for the remaining rows of df['raw_data'] (else 0), the "loop" restarts, and it begins from the first row of the dataframe and starts applying the get_raw_data all over again, seemingly stuck in an infinite loop.
What's more confounding, is that it does not always do this. The functions as written initially solved this same problem, and have been working as intended for the past ~10 weeks, but now all the sudden when I set week_to_be_predicted to 15, it is reverting to its old ways.
I'm wondering if this has something to do with the apply() function, the conditions inside the apply function, or both. It's difficult for me to troubleshoot, as the logic has worked in the past. I'm wondering if there is something about apply() that makes this a less than optimal approach, and if anybody knows what aspect might be causing the problem.
Thank you in advance.

Use a boolean mask:
def get_raw_data(sr):
return -sr
def get_initials(sr):
return sr
df = pd.DataFrame({'week_int': np.arange(0, 19),
'boxscore': np.random.random(19)})
m = df['week_int'] < week_to_be_predicted
df.loc[m, 'raw_data'] = get_raw_data(df.loc[m, 'boxscore'])
df.loc[~m, 'raw_data'] = get_initials(df.loc[~m, 'boxscore'])
Output:
>>> df
week_int boxscore raw_data
0 0 0.232081 -0.232081
1 1 0.890318 -0.890318
2 2 0.372760 -0.372760
3 3 0.697202 -0.697202
4 4 0.400200 -0.400200
5 5 0.793784 -0.793784
6 6 0.783359 -0.783359
7 7 0.898331 -0.898331
8 8 0.440433 -0.440433
9 9 0.415760 -0.415760
10 10 0.599502 -0.599502
11 11 0.941613 -0.941613
12 12 0.039865 -0.039865
13 13 0.820617 -0.820617
14 14 0.471396 -0.471396
15 15 0.794547 0.794547
16 16 0.682332 0.682332
17 17 0.638694 0.638694
18 18 0.761995 0.761995

Pandas apply polyfit to a series against a value of the series

I'm new to the Pandas world and it has been hard to stop thinking sequentially.
I have a Series like:
df['sensor'].head(30)
0 6.8855
1 6.8855
2 6.8875
3 6.8885
4 6.8885
5 6.8895
6 6.8895
7 6.8895
8 6.8905
9 6.8905
10 6.8915
11 6.8925
12 6.8925
13 6.8925
14 6.8925
15 6.8925
16 6.8925
17 6.8925
Name: int_price, dtype: float64
I want to calculate the polynomial fit of the first value against all others to the find and average. I defined a function to do the calculation and I want it to be applied onto the series.
The function:
def linear_trend(a,b):
return np.polyfit([1,2],[a,b],1)
The application:
a = pd.Series(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index)))
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
This returns TypeError: No loop matching the specified signature and casting was found for ufunc lstsq_m.
or this:
a = df_plot['sensor'].iloc[0]
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
That returns ValueError: setting an array element with a sequence.
How can I solve this?

I was able to work around my issue by doing the following:
a = pd.Series(data=(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index))), name='sensor_ref')
df_poly = pd.concat([a,df_plot['sensor']],axis=1)
df_plot['slope'] = df_poly[['sensor_ref','sensor']].apply(lambda df_poly: linear_trend(df_poly['sensor_ref'],df_poly['sensor']), axis=1)
If you have a better method, it's welcome.

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.

An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100

As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

align timeseries in pandas

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')

You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to create new column based on multiple columns with a function - python

You need to vectorize your self defined function with np.vectorize, since the function parameters are accepted as pandas Series: inter = np.vectorize(inter) df['new'] = inter(df['on95'],df['on102.5']) df on95 on102.5 on105 new #Index # 1 5 17 20 13.000000 # 2 7 15 25 12.333333 # 3 6 16 23 12.666667

Related

split Python DataFrame into k parts with index and iterate over them in a loop

Apply() function based on column condition restarting instead of changing

Pandas apply polyfit to a series against a value of the series

how to construct an index from percentage change time series?

align timeseries in pandas

Categories

Resources