python piecewise linear interpolation across dataframes in a list - python

I am trying to apply piecewise linear interpolation. I first tried to use pandas built-in interpolate function but it was not working.
Example data looks below
import pandas as pd
import numpy as np
d = {'ID':[5,5,5,5,5,5,5], 'month':[0,3,6,9,12,15,18], 'num':[7,np.nan,5,np.nan,np.nan,5,8]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6,6,6], 'month':[0,3,6,9,12,15,18], 'num':[5,np.nan,2,np.nan,np.nan,np.nan,7]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
The actual data has over 1000 unique IDs, so I filtered each ID into a dataframe and put them into the list.
The first dataframe in the list looks as below
I am trying to go through all the dataframe in the list to do a piecewise linear interpolation. I tried to change month to a index and use .interpolate(method='index', inplace = True) but it was not working.
The expected output is
ID | month | num
5 | 0 | 7
5 | 3 | 6
5 | 6 | 5
5 | 9 | 5
5 | 12 | 5
5 | 15 | 5
5 | 18 | 8
This needs to be applied across all the dataframes in the list.

Assuming this is a follow up of your previous question, change the code to:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
# optional, because of the previous question
.reindex(range(df['month'].min(), df['month'].max()+3, 3))
.interpolate()
.reset_index()[df.columns]
)
NB. I simplified the code to remove the groupby, which only works if you have a single group per DataFrame, as you mentioned in the other question.
Output:
[ ID month num
0 5 0 7.0
1 5 3 6.0
2 5 6 5.0
3 5 9 5.0
4 5 12 5.0
5 5 15 5.0
6 5 18 8.0,
ID month num
0 6 0 5.00
1 6 3 3.50
2 6 6 2.00
3 6 9 3.25
4 6 12 4.50
5 6 15 5.75
6 6 18 7.00]

Related

Adding rows in dataframe with a condition in list

I am trying to add a row with the condition but was having difficulty achieving this.
Currently, I have pandas dataframes in a list that looks like following
The objective is to add a row with the condition that I want to add a row with a fixed number for 'ID' and increase the month by 3.
For example, for this[1] I want it to add rows that look like following
ID | month | num
6 | 0 | 5
6 | 3 | NaN
6 | 6 | 4
6 | 9 | NaN
6 | 12 | 3
...
6 | 36 | 1
I am trying to create a function that takes the index of the list (so it would be an actual dataframe), the max number of the month of that dataframe, and month I want it to be incremented by (3), which would look like
def add_rows(df, max_mon, res):
if max_mon > res:
add rows with fixed ID and NaN num
skip the month that already exist
final = []
for i in range(len(this)):
final.append(add_rows(this[i], this[i]['month'].max(), 3))
I have tried to insert rows but I did not manage to get it work.
The toy data
d = {'ID':[5,5,5,5,5], 'month':[0,6,12,24,36], 'num':[5,4,3,2,1]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6], 'month':[0,6,12,18,36], 'num':[5,4,3,2,1]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
I would really appreciate if I could get help on building the function!
You can use:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
.groupby('ID')
.apply(lambda x: x.drop(columns='ID')
.reindex(range(x.index.min(), x.index.max()+3, 3))
)
.reset_index()[df.columns]
)
Updated this:
[ ID month num
0 5 0 5.0
1 5 3 NaN
2 5 6 4.0
3 5 9 NaN
4 5 12 3.0
5 5 15 NaN
6 5 18 NaN
7 5 21 NaN
8 5 24 2.0
9 5 27 NaN
10 5 30 NaN
11 5 33 NaN
12 5 36 1.0,
ID month num
0 6 0 5.0
1 6 3 NaN
2 6 6 4.0
3 6 9 NaN
4 6 12 3.0
5 6 15 NaN
6 6 18 2.0
7 6 21 NaN
8 6 24 NaN
9 6 27 NaN
10 6 30 NaN
11 6 33 NaN
12 6 36 1.0]

I try to map values to a column in pandas but i get nan values instead

This is my code:
mapping = {"ISTJ":1, "ISTP":2, "ISFJ":3, "ISFP":4, "INFP":6, "INTJ":7, "INTP":8, "ESTP":9, "ESTJ":10, "ESFP":11, "ESFJ":12, "ENFP":13, "ENFJ":14, "ENTP":15, "ENTJ":16, "NaN": 17}
q20 = castaway_details["personality_type"]
q20["personality_type"] = q20["personality_type"].map(mapping)
the data frame is like this
personality_type
0 INTP
1 INFP
2 INTJ
3 ISTJ
4 NAN
5 ESFP
I want the output like this:
personality_type
0 8
1 6
2 7
3 1
4 17
5 11
however, what I get from my code is all NANs
Try to pandas.Series.str.strip before the pandas.Series.map :
q20["personality_type"]= q20["personality_type"].str.strip().map(mapping)
# Output :
print(q20)
personality_type
0 8
1 6
2 7
3 1
4 17
5 11
The key NaN in your mapping dictionary and NaN value in your data frame do not match. I have modified the one in your dictionary.
df.apply(lambda x: x.fillna('NAN').map(mapping))
personality_type
0 8
1 6
2 7
3 1
4 17
5 11

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Python pandas: creating a discrete series from a cumulative

I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])

Pandas: Insert dataframe into other dataframe without preserving indices

I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])

Categories