I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.
My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.
Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.
Single-level index, works as expected:
"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)
new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df_i = df_i.reindex(new_index).interpolate()
print(df_i)
Outputs, the original df and the re-indexed and interpolated one:
value
step
1 1
3 2
5 1
value
step
1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
Works great.
Multi-index, currently not working:
"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)
df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)
Output, unchanged after reindex (and therefore after interpolate):
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
How do I actually reindex a column in a multi-index df?
Here's the output I'd like, assuming linear interpolation:
value
sample meas_id step
1 1 1 1
2 1.5
3 2
5 1
6 1
7 1
8 1
9 1
2 1 NaN (or 2)
2 NaN (or 2)
3 2
4 2
5 2
6 1.5
7 1
8 0.5
9 0
I spent some sincere time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Possibly related GitHub issues:
https://github.com/numpy/numpy/issues/11975
https://github.com/pandas-dev/pandas/issues/23104
https://github.com/pandas-dev/pandas/issues/17132
IIUC create the index by using MultiIndex.from_product, then just do reindex
idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])
df_mi.reindex(idx).interpolate()
Out[161]:
value
1 1 1 1.000000
2 1.500000
3 2.000000
4 1.500000
5 1.000000
6 1.142857
7 1.285714
8 1.428571
9 1.571429
2 1 1.714286 # here is bad , it take previous value into consideration
2 1.857143
3 2.000000
4 2.000000
5 2.000000
6 1.500000
7 1.000000
8 0.500000
9 0.000000
My think
def idx(x):
idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
return idx
pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])
value
1 1 1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
2 1 NaN
2 NaN
3 2.0
4 2.0
5 2.0
6 1.5
7 1.0
8 0.5
9 0.0
Related
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15
I would like to apply a function that acts like fillna() but takes a different value than nan. Unfortunately DataFrame.replace() will not work in my case. Here is an example: Given a DataFrame:
df = pd.DataFrame([[1,2,3],[4,-1,-1],[5,6,-1]])
0 1 2
0 1 2.0 3.0
1 4 -1.0 -1.0
2 5 6.0 -1.0
3 7 8.0 NaN
I am looking for a function which will output:
0 1 2
0 1 2.0 3.0
1 4 2.0 3.0
2 5 6.0 3.0
3 7 8.0 NaN
So df.replace() with to_replace=-1 and 'method='ffill' will not work because it requires a column-independent value which will replace the -1 entries. In my example it is column-dependent. I know I can code it with a loop but am looking for an efficient code as it will be applied to a large DataFrame. Any suggestions? Thank you.
You can just replace the value with NaN and then call ffill:
In [3]:
df.replace(-1, np.NaN).ffill()
Out[3]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
I think you're over thinking this
EDIT
If you already have NaN values then create a boolean mask and update just those elements again with ffill on the inverse of the mask:
In [15]:
df[df == -1] = df[df != -1].ffill()
df
Out[15]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN
Another method (thanks to #DSM in comments) is to use where to essentially do the same thing as above:
In [17]:
df.where(df != -1, df.replace(-1, np.nan).ffill())
Out[17]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN
Let's say I have some DataFrame (with about 10000 rows in my case, this is just a minimal example)
>>> import pandas as pd
>>> sample_df = pd.DataFrame(
{'col1': list(range(1, 10)), 'col2': list(range(10, 19))})
>>> sample_df
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 6 15
6 7 16
7 8 17
8 9 18
For my purposes, I need to calculate the series represented by ln(col_i(n+1) / col_i(n)) for each col_i in my DataFrame, where n represents a row number.
How can I calculate this?
Background knowledge
I know that I can get the difference between each column in a very simple way using
>>> sample_df.diff()
col1 col2
0 NaN NaN
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
Or the percentage change, which is (col_i(n+1) - col_i(n))/col_i(n+1), using
>>> sample_df.pct_change()
col1 col2
0 NaN NaN
1 1.000000 0.100000
2 0.500000 0.090909
3 0.333333 0.083333
4 0.250000 0.076923
5 0.200000 0.071429
6 0.166667 0.066667
7 0.142857 0.062500
8 0.125000 0.058824
I have just been struggling with a straightforward way to get the direct division of each consecutive column by the previous. Were I to know how to do that even, I could just apply the natural logarithm to every element in the series after the fact.
Currently to solve my problem, I'm resorting to creating another column shifted with row elements down by 1 for each column and then applying the formula between the two columns. It seems messy and sub-optimal to me, though.
Any help would be greatly appreciated!
IIUC:
log of a ratio is the difference of logs:
sample_df.apply(np.log).diff()
Or better still:
np.log(sample_df).diff()
Timing
just use np.log:
np.log(df.col1 / df.col1.shift())
you can also use apply as suggested by #nikita but that will be slower.
in addition, if you wanted to do it for the entire dataframe, you could just do:
np.log(df / df.shift())
You can use shift for that, which does what you have proposed.
>>> sample_df['col1'].shift()
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
Name: col1, dtype: float64
The final answer would be:
import math
(sample_df['col1'] / sample_df['col1'].shift()).apply(lambda row: math.log(row))
0 NaN
1 0.693147
2 0.405465
3 0.287682
4 0.223144
5 0.182322
6 0.154151
7 0.133531
8 0.117783
Name: col1, dtype: float64