Python curve fitting on pandas dataframe then add coef to new columns - python

I have a dataframe that needs to be curve fitted per row (second order polynomial).
There are four columns, each column name denotes the x value.
Each row contains 4 y values corresponding to the x values in the column name.
For example:
Based on the code below, The fitting for the first row will take x = [2, 5, 8, 12] and y = [5.91, 28.06, 67.07, 145.20]
import numpy as np
import panda as pd
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'id2': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
'x': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
'y': [5.91, 4.43, 5.22, 1.31, 4.42, 3.65, 4.45, 1.70, 3.94, 3.29, 28.06, 19.51, 23.30, 4.20, 18.61, 17.60, 18.27, 16.18, 16.81, 16.37, 67.07, 46.00, 54.95, 43.66, 42.70, 41.32, 12.69, 36.75, 41.36, 38.66, 145.20, 118.34, 16.74, 94.10, 93.45, 86.60, 26.17, 77.12, 91.42, 83.11]})
pivot_df = df.pivot_table(index=['id','id2'],columns=['x'])
[output]
>>> pivot_df
y
x 2 5 8 12
id id2
1 A 5.91 28.06 67.07 145.20
B 3.65 17.60 41.32 86.60
2 A 4.43 19.51 46.00 118.34
B 4.45 18.27 12.69 26.17
3 A 5.22 23.30 54.95 16.74
B 1.70 16.18 36.75 77.12
4 A 1.31 4.20 43.66 94.10
B 3.94 16.81 41.36 91.42
5 A 4.42 16.37 42.70 93.45
B 3.29 18.61 38.66 83.11
I want to perform the curve fitting without explicitly iterating over the rows in order to make use of the high performance under-the-hood iterating built into pandas' dataframes. I am not sure how to do so.
I wrote the code to loop through the rows anyway to show the desired output. Although the code below does work and provides the desired output, I need help in making it more concise/efficient.
my_coef_array = np.zeros(3)
#get the x values from the column names
x = pivot_df.columns.get_level_values(pivot_df.columns.names.index('x')).values
for index in pivot_df.index:
my_coef_array = np.vstack((my_coef_array,np.polyfit(x, pivot_df.loc[index].values, 2)))
my_coef_array = my_coef_array[1:,:]
pivot_df['m2'] = my_coef_array[:,0]
pivot_df['m1'] = my_coef_array[:,1]
pivot_df['c'] = my_coef_array[:,2]
[output]
>>> pivot_df
y m2 m1 c
x 2 5 8 12
id id2
1 A 5.91 28.06 67.07 145.20 0.934379 0.848422 0.471170
B 3.65 17.60 41.32 86.60 0.510664 1.156009 -0.767408
2 A 4.43 19.51 46.00 118.34 1.034594 -3.221912 7.518221
B 4.45 18.27 12.69 26.17 -0.015300 2.045216 2.496306
3 A 5.22 23.30 54.95 16.74 -1.356997 20.827407 -35.130416
B 1.70 16.18 36.75 77.12 0.410485 1.772052 -3.345097
4 A 1.31 4.20 43.66 94.10 0.803630 -1.577705 -1.148066
B 3.94 16.81 41.36 91.42 0.631377 -0.085651 1.551586
5 A 4.42 16.37 42.70 93.45 0.659044 -0.278738 2.068114
B 3.29 18.61 38.66 83.11 0.478171 1.218486 -0.638888

I found the following numpy.polynomial.polynomial.polyfit which is an alternative to np.polyfit that takes a 2-D array for y.
Starting your code from x, I get the following:
my_coef_array = pd.DataFrame(np.polynomial.polynomial.polyfit(x, pivot_df.T.values, 2)).T
my_coef_array.index = pivot_df.index
my_coef_array.columns = ['c', 'm1', 'm2']
pivot_df = pivot_df.join(my_coef_array)

Related

Find highest two numbers on every row in pandas dataframe and extract the column names

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

How to update original array with groupby in python

I have a dataset and I am trying to iterate each group and based on each group, I am trying to update original groups:
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for group_name, b in df.groupby("grain"):
if group_name == "A":
// do some processing
if group_name == "B":
// do another processing
I expect to see original df is updated. Is there any way to do it?
Here is a way to change the original data, this example requires a non-duplicate index. I am not sure what would be the benefit of this approach compared to using classical pandas operations.
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for g_name, g_df in df.groupby("grain"):
if g_name == "A":
df.loc[g_df.index, 'target'] *= 10
if g_name == "B":
df.loc[g_df.index, 'target'] *= -1
Output:
>>> df
grain target
0 A 10
1 B -2
2 A 40
3 B -7
4 A 110
5 B -16
6 A 220
7 B -29
8 A 370
9 B -46

Python appending a list to dataframe column

I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]

dynamic shift with groupby on dataframe

I need to shift a grouped data frame by a dynamic number. I can do it with apply, but the performance is not very good.
Any way to do that without apply?
Here is a sample of what I would like to do:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
# THIS DOESN'T WORK:
df['VALUE'] = df.groupby('GROUP').SUM.shift(df.SHIFT)
I do it with apply the following way:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
def func(group):
s = group.SHIFT.iloc[0]
group['SUM'] = group.SUM.shift(s)
return group
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
df = df.groupby('GROUP').apply(func)
Here is a pure numpy version that works if the data frame is sorted by group (like your example):
# these rows are not null after shifting
notnull = np.where(df.groupby('GROUP').cumcount() >= df['SHIFT'])[0]
# source rows for rows above
source = notnull - df['SHIFT'].values[notnull]
shifted = np.empty(df.shape[0])
shifted[:] = np.nan
shifted[notnull] = df.groupby('GROUP')['VALUE'].cumsum().values[source]
df['SUM'] = shifted
It first gets the indices of rows that are to be updated. The shifts can be subtracted to yield the source rows.
A solution that avoids apply, could be the following, if the groups are contiguous:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
# compute values required for the slices
_, start = np.unique(df.GROUP.values, return_index=True)
gp = df.groupby('GROUP')
shifts = gp.SHIFT.first()
sizes = gp.size().values
end = (sizes - shifts.values) + start
# compute slices
source = [i for s, f in zip(start, end) for i in range(s, f)]
target = [i for j, s, f in zip(start, shifts, sizes) for i in range(j + s, j + f)]
# compute cumulative sum and arrays of nan
s = gp.VALUE.cumsum().values
r = np.empty_like(s, dtype=np.float32)
r[:] = np.nan
# set the on the array of nan
np.put(r, target, s[source])
# set the sum column
df['SUM'] = r
print(df)
Output
GROUP SHIFT VALUE SUM
0 A 2 1 NaN
1 A 2 2 NaN
2 A 2 3 1.0
3 A 2 4 3.0
4 A 2 5 6.0
5 A 2 6 10.0
6 B 3 7 NaN
7 B 3 8 NaN
8 B 3 9 NaN
9 B 3 0 7.0
10 B 3 1 15.0
11 B 3 2 24.0
With the exception of building the slices (source and target) all computations are done in a pandas/numpy level that should be fast. The idea is to manually simulate what would be done in the apply function.

Reshape rows to columns in pandas dataframe

In pandas how to go from a:
a = pd.DataFrame({'foo': ['m', 'm', 'm', 's', 's', 's'],
'bar': [1, 2, 3, 4, 5, 6]})
>>> a
bar foo
0 1 m
1 2 m
2 3 m
3 4 s
4 5 s
5 6 s
to b:
b = pd.DataFrame({'m': [1, 2, 3],
's': [4, 5, 6]})
>>> b
m s
0 1 4
1 2 5
2 3 6
I tried solutions in other answers, e.g. here and here but none seemed to do what I want.
Basically, I want to swap rows with columns and drop the index, but how to do it?
a.set_index(
[a.groupby('foo').cumcount(), 'foo']
).bar.unstack()
This is my solution
a = pd.DataFrame({'foo': ['m', 'm', 'm', 's', 's', 's'],
'bar': [1, 2, 3, 4, 5, 6]})
a.pivot(columns='foo', values='bar').apply(lambda x: pd.Series(x.dropna().values))
foo m s
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0

Categories