python dataframe number of last consequence rows less than current - python

I need to set number of last consequence rows less than current.
Below is a sample input and the result.
df = pd.DataFrame([10,9,8,11,10,11,13], columns=['value'])
df_result = pd.DataFrame([[10,9,8,11,10,11,13], [0,0,0,3,0,1,6]], columns=['value', 'number of last consequence rows less than current'])
Is it possible to achieve this without loop?
Otherwise solution with loop would be good.
More question
Could I do it with groupby operation, for the following input?
df = pd.DataFrame([[10,0],[9,0],[7,0],[8,0],[11,1],[10,1],[11,1],[13,1]], columns=['value','group'])
Following printed an error.
df.groupby('group')['value'].expanding()

Assuming this input:
value
0 10
1 9
2 8
3 11
4 10
5 13
You can use a cummax and expanding custom function:
df['out'] = (df['value'].cummax().expanding()
.apply(lambda s: s.lt(df.loc[s.index[-1], 'value']).sum())
)
For the particular case of < comparison, you can use a much faster trick with numpy. If a value is greater than all previous values, then it is greater than n values where n is the rank:
m = df['value'].lt(df['value'].cummax())
df['out'] = np.where(m, 0, np.arange(len(df)))
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 13 5.0
update: consecutive values
df['out'] = (
df['value'].expanding()
.apply(lambda s: s.iloc[-2::-1].lt(s.iloc[-1]).cummin().sum())
)
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 11 1.0
6 13 6.0

Related

unable to replace zeros in the last row of dataframe with mean of last three rows in respective columns while leaving non zero values as it is

I have a dataset which looks like this:
import pandas as pd, numpy as np
df = pd.DataFrame([[1,0,3,0], [5,6,7,8], [9,10,11,12], [11, 14,15,16], [0,0,19,0]], columns=['a','b','c','d'])
So what I want to do is:
in the last row, wherever value is 0s, replace with the mean of previous three rows of the same column
if the value is not 0, then leave it as it is
Also all other 0's elsewhere should remain 0 only.
So the end result should look something like this:
a b c d
1 0 3 0
5 6 7 8
9 10 11 12
13 14 15 16
9 10 19 12
Here, all three 0s are replaced with the previous three values' mean. And 19 remains as it is.
What I am trying to do is:
if (df.iloc[-1].any()==0):
df.iloc[-1] = df[-4:-1].mean()
else:
pass
This did not change the values and no error was returned as well. What am I doing wrong here?
It'll be much easier if you just replace 0 with NaN then use fillna with rolling mean, and shift:
>>> df.iloc[-1]=df.iloc[-1].replace(0, np.nan)
>>> df=df.fillna(df.rolling(3, min_periods=1).mean().shift())
OUTPUT:
a b c d
0 1.0 0.0 3 0.0
1 5.0 6.0 7 8.0
2 9.0 10.0 11 12.0
3 13.0 14.0 15 16.0
4 9.0 10.0 19 12.0
With np.where:
last_row = df.iloc[-1]
df.iloc[-1] = np.where(last_row.eq(0), df.iloc[-4:-1].mean(), last_row)
This will take values from three previous rows' mean where last row is equal to 0 and from the last row itself otherwise, i.e., nonzero values will stay as is.
pandas' where can be similarly used:
last_row = df.iloc[-1]
df.iloc[-1] = last_row.where(last_row.ne(0), df.iloc[-4:-1].mean())
Where the last row's values are not equal to 0 will be replaced with the mean of previous three's mean.

pandas bfill by interval to correct missing/invalid entries

so I have a dataframe
df = pandas.DataFrame([[numpy.nan,5],[numpy.nan,5],[2015,5],[2020,5],[numpy.nan,10],[numpy.nan,10],[numpy.nan,10],[2090,10],[2100,10]],columns=["value","interval"])
value interval
0 NaN 5
1 NaN 5
2 2015.0 5
3 2020.0 5
4 NaN 10
5 NaN 10
6 NaN 10
7 2090.0 10
8 2100.0 10
I need to backwards fill the NaN values based on their interval and the first non-nan following that index so the expected output is
value interval
0 2005.0 5 # corrected 2010 - 5(interval)
1 2010.0 5 # corrected 2015 - 5(interval)
2 2015.0 5 # no change ( use this to correct 2 previous rows)
3 2020.0 5 # no change
4 2060.0 10 # corrected 2070 - 10
5 2070.0 10 # corrected 2080 - 10
6 2080.0 10 # corrected 2090 - 10
7 2090.0 10 # no change (use this to correct 3 previous rows)
8 2100.0 10 # no change
I am at a loss as to how i can accomplish this task using pandas/numpy vectorized operations ...
I can do it with a pretty simple loop
last_good_value = None
fixed_values = []
for val,interval in reversed(df.values):
if val == numpy.nan and last_good_value is not None:
fixed_values.append(last_good_value - interval)
last_good_value = fixed_values[-1]
else:
fixed_values.append(val)
if val != numpy.nan:
last_good_value = val
print (reversed(fixed_values))
which strictly speaking works... but i would like to understand a pandas solution that can resolve the value, and avoid the loops (this is quite a big list in reality)
First, get the position of the rows within groups sharing same 'interval' value.
Then, get the last value of each group.
What you are looking for is "last_value - pos * interval"
df = df.reset_index()
grouped_df = df.groupby(['interval'])
df['pos'] = grouped_df['index'].rank(method='first', ascending=False) - 1
df['last'] = grouped_df['value'].transform('last')
df['value'] = df['last'] - df['interval'] * df['pos']
del df['pos'], df['last'], df['index']
Create a grouping Series that groups the last non-null value with all NaN rows before it, by reversing with [::-1]. Then you can bfill and use cumsum to determine how much to subtract off of every row.
s = df['value'].notnull()[::-1].cumsum()
subt = df.loc[df['value'].isnull(), 'interval'][::-1].groupby(s).cumsum()
df['value'] = df.groupby(s)['value'].bfill().subtract(subt, fill_value=0)
value interval
0 2005.0 5
1 2010.0 5
2 2015.0 5
3 2020.0 5
4 2060.0 10
5 2070.0 10
6 2080.0 10
7 2090.0 10
8 2100.0 10
Because subt is subset to only the NaN rows, the fill_value=0 ensures rows with values remain unchanged
print(subt)
#6 10
#5 20
#4 30
#1 5
#0 10
#Name: interval, dtype: int64

Nested lists to python dataframe

I have a nested numpy.ndarray of the following format (each of the sublists has the same size)
len(exp_data) # Timepoints
Out[205]: 42
len(exp_data[0])
Out[206]: 1
len(exp_data[0][0]) # Y_bins
Out[207]: 13
len(exp_data[0][0][0]) # X_bins
Out[208]: 43
type(exp_data[0][0][0][0])
Out[209]: numpy.float64
I want to move these into a pandas DataFrame such that there are 3 columns numbered from 0 to N and the last one with the float value.
I could do this with a series of loops, but that seems like a very non-efficient way of solving the problem.
In addition I would like to get rid of any nan values (not present in sample data). Do I do this after creating the df or is there a way to skip adding them in the first place?
NOTE: code below has been edited and I've added sample data
import random
import numpy as np
import pandas as pd
exp_data = [[[ [random.random() for x in range (5)],
[random.random() for x in range (5)],
[random.random() for x in range (5)],
]]]*5
exp_data[0][0][0][1]=np.nan
df = pd.DataFrame(columns = ['Timepoint','Y_bin','X_bin','Values'])
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [int(t),int(y),int(x),x_bin]
df = df.dropna().reset_index(drop=True)
The final format should be as follows (except I'd preferably like integers instead of floats in first 3 columns, but not essential; int(t) etc. doesn't do the trick)
df
Out[291]:
Timepoint Y_bin X_bin Values
0 0.0 0.0 0.0 0.095391
1 0.0 0.0 2.0 0.963608
2 0.0 0.0 3.0 0.855735
3 0.0 0.0 4.0 0.392637
4 0.0 1.0 0.0 0.555199
5 0.0 1.0 1.0 0.118981
6 0.0 1.0 2.0 0.201782
...
len(df) # has received a total of 75 (5*3*5) input values of which 5 are nan
Out[293]: 70
change the format of the float out put to this by adding this piece of code
pd.options.display.float_format = '{:,.0f}'.format
to the end of your code like this to change the format
df = pd.DataFrame(columns = columns)
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [t,y,x,x_bin]
df.dropna().reset_index(drop=True)
pd.options.display.float_format = '{:,.0f}'.format
df
Out[250]:
Timepoint Y_bin X_bin Values
0 0 4 10 -2
1 0 4 11 -1
2 0 4 12 -2
3 0 4 13 -2
4 0 4 14 -2
5 0 4 15 -2
6 0 4 16 -3
...

Vectorization of loops in python

I have the following code in Python:
import numpy as np
import pandas as pd
colum1 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum2 = [10,20,30,40,50,60,70,80,90,100,110,120]
df = pd.DataFrame({
'colum1' : colum1,
'colum2' : colum2
});
df.loc[df.colum1 == 1,'result'] = df['colum2']
for i in range(len(colum2)):
df.result = np.where(df.colum1>1, 5 - (df['colum2'] - df.result.shift(1)), df.result)
the result of df.result is:
colum1 colum2 result
0 1 10 10.0
1 2 20 -5.0
2 3 30 -30.0
3 4 40 -65.0
4 5 50 -110.0
5 6 60 -165.0
6 7 70 -230.0
7 8 80 -305.0
8 9 90 -390.0
9 10 100 -485.0
10 11 110 -590.0
11 12 120 -705.0
I would like to know if there is a method that allows me to obtain the same result without using a cycle for
Your operation is dependent on two things, the previous row in the DataFrame, and the difference between consecutive values in the DataFrame. That hints that the solution will require shift and diff. However, you want to add a small constant to the expanding sum, as well as actually subtract this from each row, not add it.
To set the pieces of the problem up, first create your shifted series, where you add 5:
a = df.colum2.shift().add(5).cumsum().fillna(0)
Now you need the difference between elements in the Series, and fill missing results with their respective value in colum2:
b = df.colum2.diff().fillna(df.colum2)
To get your final result, simply subtract a from b:
b - a
0 10.0
1 -5.0
2 -30.0
3 -65.0
4 -110.0
5 -165.0
6 -230.0
7 -305.0
8 -390.0
9 -485.0
10 -590.0
11 -705.0
Name: colum2, dtype: float64

splitting length (metre) data by interval with Pandas

I have a dataframe of length-interval data (from boreholes) which looks something like this:
df
Out[46]:
from to min intensity
0 0 10 py 2
1 5 15 cpy 3.5
2 14 27 spy 0.7
I need to pivot this data, but also break it on the least common length interval; resulting in the 'min' column as column headers, and the values being the 'rank'. The output would look like this:
df.somefunc(index=['from','to'], columns='min', values='intensity', fill_value=0)
Out[47]:
from to py cpy spy
0 0 5 2 0 0
1 5 10 2 3.5 0
2 10 14 0 3.5 0
3 14 15 0 3.5 0.7
4 15 27 0 0 0.7
so basically the "From" and "To" describe non-overlapping intervals down a borehole, where the intervals have been split by the least common denominator - as you can see the "py" interval from the original table has been split, the first (0-5m) into py:2, cpy:0 and the second (5-10m) into py:2, cpy:3.5.
The result from just a basic pivot_table function is this:
pd.pivot_table(df, values='intensity', index=['from', 'to'], columns="min", aggfunc="first", fill_value=0)
Out[48]:
min cpy py spy
from to
0 10 0 2 0
5 15 3.5 0 0
14 27 0 0 0.75
which just treats the from and to columns combined as an index. An important point is that my output cannot have overlapping from and to values (IE the subsequent 'from' value cannot be less than the previous 'to' value).
Is there an elegant way to accomplish this using Pandas? Thanks for the help!
I don't know natural interval arithmetic in Pandas, so you need do do it.
Here a way to do that, If I correctly understand bound conditions.
This can be a O(n^3) problem, it will create huge table for big entries.
# make the new bounds
bounds=np.unique(np.hstack((df["from"],df["to"])))
df2=pd.DataFrame({"from":bounds[:-1],"to":bounds[1:]})
#find inclusions
isin=df.apply(lambda x :
df2['from'].between(x[0],x[1]-1)
| df2['to'].between(x[0]+1,x[1])
,axis=1).T
#data
data=np.where(isin,df.intensity,0)
#result
df3=pd.DataFrame(data,
pd.MultiIndex.from_arrays(df2.values.T),df["min"])
For :
In [26]: df3
Out[26]:
min py cpy spy
0 5 2.0 0.0 0.0
5 10 2.0 3.5 0.0
10 14 0.0 3.5 0.0
14 15 0.0 3.5 0.7
15 27 0.0 0.0 0.7

Categories