pandas bfill by interval to correct missing/invalid entries

pandas bfill by interval to correct missing/invalid entries - python

so I have a dataframe
df = pandas.DataFrame([[numpy.nan,5],[numpy.nan,5],[2015,5],[2020,5],[numpy.nan,10],[numpy.nan,10],[numpy.nan,10],[2090,10],[2100,10]],columns=["value","interval"])
value interval
0 NaN 5
1 NaN 5
2 2015.0 5
3 2020.0 5
4 NaN 10
5 NaN 10
6 NaN 10
7 2090.0 10
8 2100.0 10
I need to backwards fill the NaN values based on their interval and the first non-nan following that index so the expected output is
value interval
0 2005.0 5 # corrected 2010 - 5(interval)
1 2010.0 5 # corrected 2015 - 5(interval)
2 2015.0 5 # no change ( use this to correct 2 previous rows)
3 2020.0 5 # no change
4 2060.0 10 # corrected 2070 - 10
5 2070.0 10 # corrected 2080 - 10
6 2080.0 10 # corrected 2090 - 10
7 2090.0 10 # no change (use this to correct 3 previous rows)
8 2100.0 10 # no change
I am at a loss as to how i can accomplish this task using pandas/numpy vectorized operations ...
I can do it with a pretty simple loop
last_good_value = None
fixed_values = []
for val,interval in reversed(df.values):
if val == numpy.nan and last_good_value is not None:
fixed_values.append(last_good_value - interval)
last_good_value = fixed_values[-1]
else:
fixed_values.append(val)
if val != numpy.nan:
last_good_value = val
print (reversed(fixed_values))
which strictly speaking works... but i would like to understand a pandas solution that can resolve the value, and avoid the loops (this is quite a big list in reality)

First, get the position of the rows within groups sharing same 'interval' value.
Then, get the last value of each group.
What you are looking for is "last_value - pos * interval"
df = df.reset_index()
grouped_df = df.groupby(['interval'])
df['pos'] = grouped_df['index'].rank(method='first', ascending=False) - 1
df['last'] = grouped_df['value'].transform('last')
df['value'] = df['last'] - df['interval'] * df['pos']
del df['pos'], df['last'], df['index']

Create a grouping Series that groups the last non-null value with all NaN rows before it, by reversing with [::-1]. Then you can bfill and use cumsum to determine how much to subtract off of every row.
s = df['value'].notnull()[::-1].cumsum()
subt = df.loc[df['value'].isnull(), 'interval'][::-1].groupby(s).cumsum()
df['value'] = df.groupby(s)['value'].bfill().subtract(subt, fill_value=0)
value interval
0 2005.0 5
1 2010.0 5
2 2015.0 5
3 2020.0 5
4 2060.0 10
5 2070.0 10
6 2080.0 10
7 2090.0 10
8 2100.0 10
Because subt is subset to only the NaN rows, the fill_value=0 ensures rows with values remain unchanged
print(subt)
#6 10
#5 20
#4 30
#1 5
#0 10
#Name: interval, dtype: int64

Related

python dataframe number of last consequence rows less than current

I need to set number of last consequence rows less than current.
Below is a sample input and the result.
df = pd.DataFrame([10,9,8,11,10,11,13], columns=['value'])
df_result = pd.DataFrame([[10,9,8,11,10,11,13], [0,0,0,3,0,1,6]], columns=['value', 'number of last consequence rows less than current'])
Is it possible to achieve this without loop?
Otherwise solution with loop would be good.
More question
Could I do it with groupby operation, for the following input?
df = pd.DataFrame([[10,0],[9,0],[7,0],[8,0],[11,1],[10,1],[11,1],[13,1]], columns=['value','group'])
Following printed an error.
df.groupby('group')['value'].expanding()

Assuming this input:
value
0 10
1 9
2 8
3 11
4 10
5 13
You can use a cummax and expanding custom function:
df['out'] = (df['value'].cummax().expanding()
.apply(lambda s: s.lt(df.loc[s.index[-1], 'value']).sum())
)
For the particular case of < comparison, you can use a much faster trick with numpy. If a value is greater than all previous values, then it is greater than n values where n is the rank:
m = df['value'].lt(df['value'].cummax())
df['out'] = np.where(m, 0, np.arange(len(df)))
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 13 5.0
update: consecutive values
df['out'] = (
df['value'].expanding()
.apply(lambda s: s.iloc[-2::-1].lt(s.iloc[-1]).cummin().sum())
)
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 11 1.0
6 13 6.0

Comparing two columns and if condition is met add '1' to a new column [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I cannot figure out how to compare two columns and if one columns is greater than or equal to another number input '1' to a new column. If the condition is not met I would like python to do nothing.
The data set for testing is here:
data = [[12,10],[15,10],[8,5],[4,5],[15,'NA'],[5,'NA'],[10,10], [9,10]]
df = pd.DataFrame(data, columns = ['Score', 'Benchmark'])
Score Benchmark
0 12 10
1 15 10
2 8 5
3 4 5
4 15 NA
5 5 NA
6 10 10
7 9 10
The desired output is:
desired_output_data = [[12,10, 1],[15,10,1],[8,5,1],[4,5],[15,'NA'],[5,'NA'],[10,10,1], [9,10]]
desired_output_df = pd.DataFrame(desired_output_data, columns = ['Score', 'Benchmark', 'MetBench'])
Score Benchmark MetBench
0 12 10 1.0
1 15 10 1.0
2 8 5 1.0
3 4 5 NaN
4 15 NA NaN
5 5 NA NaN
6 10 10 1.0
7 9 10 NaN
I tried doing something like this:
if df['Score'] >= df['Benchmark']:
df['MetBench'] = 1
I am new to programming in general so any guidance would be greatly appreciated.
Thank you!

Can usege and map
df.Score.ge(df.Benchmark).map({True: 1, False:np.nan})
or use the mapping from False to np.nan implicitly, since pandas uses the dict.get method to apply the mapping, and None is the default value (thanks #piRSquared)
df.Score.ge(df.Benchmark).map({True: 1})
Or simply series.where
df.Score.ge(df.Benchmark).where(lambda s: s)
Both outputs
0 1.0
1 1.0
2 1.0
3 NaN
4 NaN
5 NaN
6 1.0
7 NaN
dtype: float64
Make sure to do
df['Benchmark'] = pd.to_numeric(df['Benchmark'], errors='coerce')
first, since you have 'NA' as a string, but you need the numeric value np.nan to be able to compare it with other numbers

Vectorization of loops in python

I have the following code in Python:
import numpy as np
import pandas as pd
colum1 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum2 = [10,20,30,40,50,60,70,80,90,100,110,120]
df = pd.DataFrame({
'colum1' : colum1,
'colum2' : colum2
});
df.loc[df.colum1 == 1,'result'] = df['colum2']
for i in range(len(colum2)):
df.result = np.where(df.colum1>1, 5 - (df['colum2'] - df.result.shift(1)), df.result)
the result of df.result is:
colum1 colum2 result
0 1 10 10.0
1 2 20 -5.0
2 3 30 -30.0
3 4 40 -65.0
4 5 50 -110.0
5 6 60 -165.0
6 7 70 -230.0
7 8 80 -305.0
8 9 90 -390.0
9 10 100 -485.0
10 11 110 -590.0
11 12 120 -705.0
I would like to know if there is a method that allows me to obtain the same result without using a cycle for

Your operation is dependent on two things, the previous row in the DataFrame, and the difference between consecutive values in the DataFrame. That hints that the solution will require shift and diff. However, you want to add a small constant to the expanding sum, as well as actually subtract this from each row, not add it.
To set the pieces of the problem up, first create your shifted series, where you add 5:
a = df.colum2.shift().add(5).cumsum().fillna(0)
Now you need the difference between elements in the Series, and fill missing results with their respective value in colum2:
b = df.colum2.diff().fillna(df.colum2)
To get your final result, simply subtract a from b:
b - a
0 10.0
1 -5.0
2 -30.0
3 -65.0
4 -110.0
5 -165.0
6 -230.0
7 -305.0
8 -390.0
9 -485.0
10 -590.0
11 -705.0
Name: colum2, dtype: float64

Pandas shifted value calculation

I'd like to create a new column, containing values calculated from shifted value in other columns.
As you see the code below, first I created a time series data.
'price' is randomly generated time series data, and momentum is average momentum value of recent 12 periods.
I'd like to add a new columns, containing data with average 'n' momentum value, in which 'n' correspond to the value of df['shift'], not with fixed 12 value in the momentum function, but with the value in the 'shift' column.
How can I do this?
(In the example below, momentum was calculated with fixed 12)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df
def momentum(x):
init = 0
for i in range(1, 13):
init = x.price / x.price.shift(i) + init
return init / 12
df['momentum'] = momentum(df)
price shift momentum
0 1.069857 3 NaN
1 0.986563 7 NaN
2 0.809052 5 NaN
3 0.991204 3 NaN
4 0.846159 6 NaN
5 0.717344 4 NaN
6 0.599436 3 NaN
7 0.596711 7 NaN
8 0.543450 4 NaN
9 0.511640 3 NaN
10 0.496865 3 NaN
11 0.460142 4 NaN
12 0.435862 4 0.657192
13 0.410519 4 0.665493
14 0.368428 5 0.640927
15 0.335583 7 0.625128
16 0.313470 7 0.635423
17 0.321265 4 0.704990
18 0.319503 7 0.746885
19 0.365991 4 0.900135
20 0.300793 4 0.766266
21 0.274449 6 0.733104

This is my approach
def momentum(shift,price,array,index):
if shift > index:
return 0
else:
init = 0
for i in range(1,int(shift)+1):
init += price / array[int(index)-i]
return init
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df['Index'] = df.index
series = df['price'].tolist()
df['momentum'] = df.apply(lambda row: momentum(row['shift'],row['price'],series,row['Index']),axis=1)
print df

Pandas divide fill with multiple values

I have two pandas series objects with slightly different indexes. I want to divide one series by another. The default method gives me NAs when one of the two series is missing an indexed element. There is an option to fill missing values, but it can only be set to one value. I want to fill a value based on which series is missing the value.
For example
series1
0 10
1 20
2 30
3 40
series2
1 2
2 3
3 4
4 5
expected result: series1.divide(series2)
0 inf
1 10
2 10
3 10
4 0
actual result: series1.divide(series2)
0 NaN
1 10
2 10
3 10
4 NaN
Is there an easy way to do this?

You could use reindex to expand series1.index to include series2.index, filling missing values with 0. Then you could use the div method, which fills in missing values with NaN by default:
series1 = pd.Series([10,20,30,40], index=[0,1,2,3])
series2 = pd.Series([2,3,4,5], index=[1,2,3,4])
series1 = series1.reindex(series1.index.union(series2.index), fill_value=0)
print(series1.div(series2))
# 0 nan
# 1 10.00000
# 2 10.00000
# 3 10.00000
# 4 0.00000
# dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas bfill by interval to correct missing/invalid entries - python

Related

python dataframe number of last consequence rows less than current

Comparing two columns and if condition is met add '1' to a new column [duplicate]

Vectorization of loops in python

Pandas shifted value calculation

Pandas divide fill with multiple values

Categories

Resources