Pandas - Find longest stretch without Nan values - python

I have a pandas dataframe "df", a sample of which is below:
time x
0 1 1
1 2 Nan
2 3 3
3 4 Nan
4 5 8
5 6 7
6 7 5
7 8 Nan
The real frame is much bigger. I am trying to find the longest stretch of non NaN values in the "x" series, and print out the starting and ending index for this frame. Is this possible?
Thank You

Here's a vectorized approach with NumPy tools -
a = df.x.values # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] )) # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2) # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()] # Get max interval, interval limits
Sample run -
In [474]: a
Out[474]:
array([ 1., nan, 3., nan, nan, nan, nan, 8., 7., 5., 2.,
5., nan, nan])
In [475]: start, stop
Out[475]: (7, 12)
The intervals are set such that the difference between each start and stop would give us the length of each interval. So, by ending index if you meant to get the last index of non-zero element, we need to subtract one from stop.

pandas
f = dict(
Start=pd.Series.first_valid_index,
Stop=pd.Series.last_valid_index,
Stretch='count'
)
agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values
array([ 4., 6.])
numpy
def pir(x):
# pad with np.nan
x = np.append(np.nan, np.append(x, np.nan))
# find where null
w = np.where(np.isnan(x))[0]
# diff to find length of stretch
# argmax to find where largest stretch
a = np.diff(w).argmax()
# return original positions of boundary nulls
return w[[a, a + 1]] + np.array([0, -2])
demo
pir(df.x.values)
array([4, 6])
a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)
array([ 7, 11])

So you can get the index values of the NaN's in the following way:
import numpy as np
index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]
>>> [0, 1, 3, 7]
Then one solution would be to see the largest difference between subsequent index values and that would give you the longest stretch of non NaN values.

Maybe a faster way would be the following (given that you say you have a long dataframe, speed matters):
In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})
In [20]: index = df['x'].isnull()
In [21]: df[index].index.values
Out[21]: array([1, 3, 7])

Another method is to use scipy.ndimage.measurements.label. It will perform a segmentation of your non null index into valid group and label them differently. You can then group your dataframe using the labels and take the biggest group.
Set-up
import pandas as pd
import numpy as np
from scipy.ndimage.measurements import label
df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})
Retrieving longest stretch without nan
valid_rows = ~df.isnull().any(axis=1)
label, num_feature = label(valid_rows)
label_of_biggest_group = valid_rows.groupby(label).count().drop(0).argmax()
print df.loc[label == label_of_biggest_group]
Result
time x
4 5 8.0
5 6 7.0
6 7 5.0
Note
The label 0 contains background data in our case the nan values and it has to be dropped in case of your number of nan is greater or equal to the size of yout biggest group. num_feature is your number of homogeneous stretches without nan.

Related

numpy vectorized operation for a large array

I am trying to do some computations for a numpy array by python3.
the array:
c0 c1 c2 c3
r0 1 5 2 7
r1 3 9 4 6
r2 8 2 1 3
Here the "cx" and "rx" are column and row names.
I need to compute the difference of each element by row if the elements are not in a given column list.
e.g.
given a column list [0, 2, 1] # they are column indices
which means that
for r0, we need to calculate the difference between the c0 and all other columns, so we have
[1, 5-1, 2-1, 7-1]
for r1, we need to calculate the difference between the c2 and all other columns, so we have
[3-4, 9-4, 4, 6-4]
for r2, we need to calculate the difference between the c1 and all other columns, so we have
[8-2, 2, 1-2, 3-2]
so, the result should be
1 4 1 6
-1 5 4 2
6 2 -1 1
Because the array could be very large, I would like to do the calculation by numpy vectorized operation, e.g. broadcasting.
BuT, I am not sure how to do it efficiently.
I have checked Vectorizing operation on numpy array, Vectorizing a Numpy slice operation, Vectorize large NumPy multiplication, Replace For Loop with Numpy Vectorized Operation, Vectorize numpy array for loop.
But, none of them work for me.
thanks for any help !
Extract the values from the array first and then do subtraction:
import numpy as np
a = np.array([[1, 5, 2, 7],
[3, 9, 4, 6],
[8, 2, 1, 3]])
cols = [0,2,1]
# create the index for advanced indexing
idx = np.arange(len(a)), cols
# extract values
vals = a[idx]
# subtract array by the values
a -= vals[:, None]
# add original values back to corresponding position
a[idx] += vals
print(a)
#[[ 1 4 1 6]
# [-1 5 4 2]
# [ 6 2 -1 1]]
Playground

how do i use np.nanmin when comparing one column of pandas dataframe with a integer?

import pandas as pd
import numpy as np
a = np.array([[1, 2], [3, np.nan]])
np.nanmin(a, axis=0)
array([1., 2.])
I want to use same logic but on pandas dataframe columns and comparing each value of column with an integer.
use case:
MC_cond = df['MODEL'].isin(["MC"])
df_lgd_type = df['LGD_TYPE'].isin(["FIXED"])
df_without_lgd_type = ~(df_lgd_type)
x = np.nanmin((1,df.loc[MC_cond & df_without_lgd_type,'A'] + df.loc[MC_cond &
df_without_lgd_type,'B']))
comparing sum of column A and column B with 1.
This should do the trick even without np.nanmin. I hope I've understood everything correctly from your sparse description.
I'm assuming you also want to replace those NaN values that are left after summation. So we fill those with 1 and then clip all values to max at 1.
a = df.loc[MC_cond & df_without_lgd_type, 'A']
b = df.loc[MC_cond & df_without_lgd_type, 'B']
x = (a + b).fillna(1).clip(upper=1)
Example:
df = pd.DataFrame({
'A': [-1, np.nan, 2, 3, 4],
'B': [-4, 5, np.nan, 7, -8]
})
(df.A + df.B).fillna(1).clip(upper=1)
# Output:
# 0 -5.0
# 1 1.0
# 2 1.0
# 3 1.0
# 4 -4.0
# dtype: float64
In case you don't want NaN values in one column leading to row sum being NaN too, just fill them before:
x = (a.fillna(0) + b.fillna(0)).fillna(1).clip(upper=1)
Just for completeness, this would be a pure numpy solution resembling your approach:
a = df.loc[MC_cond & df_without_lgd_type, 'A'].to_numpy()
b = df.loc[MC_cond & df_without_lgd_type, 'B'].to_numpy()
# optionally fill NaNs with 0
# a = np.nan_to_num(a)
# b = np.nan_to_num(b)
s = a + b
x = np.nanmin(np.stack(s, np.ones_like(s))), axis=0)

In Python DataFrame how to find out number of rows that have valid values of columns

I want to find the number of rows that have certain values such as None or "" or NaN (basically empty values) in all columns of a DataFrame object. How can I do this?
Use pandas dataframe.isin to create a boolean array. Sum by row, then find
the number of rows with a result > 0.
Place one or more values in the search_values list to look for within the rows of the dataframe.
search_values = ['', np.nan, None]
(df.isin(search_values).sum(axis=1) > 0).sum()
If you would like the row count per column:
df.isin(search_values).sum(axis=0)
Use df.isnull().sum() to get number of rows with None and NaN value.
Use df.eq(value).sum() for any kind of values including empty string "".
In a pandas.Series (think of it as the column of a normal pandas.DataFrame):
>> s = pd.Series([np.nan, np.nan, 1, 2, np.nan])
>> s
0 NaN
1 NaN
2 1.0
3 2.0
4 NaN
>> s.isnull().sum()
3
For a pandas.DataFrame is quite similar:
>> pd.DataFrame(np.array([[np.nan, np.nan],
...: [ 0., np.nan],
...: [ 1., 1.],
...: [ 2., 2.],
...: [np.nan, np.nan]]))
>> df
0 1
0 NaN NaN
1 0.0 NaN
2 1.0 1.0
3 2.0 2.0
4 NaN NaN
>> df.isnull().sum(axis=0)
0 2
1 3
dtype: int64
To sum by row, just put .sum(axis=1).

A GroupBy with combinations of the categorical variables

Let's say I have data:
pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
which gives:
column
index
a 1
b 2
c 3
a 4
b 1
c 2
Then to get the mean of each subgroup one would:
df.groupby(df.index).mean()
column
index
a 2.5
b 1.5
c 2.5
However, what I've been trying to achieve without constantly looping and slicing the data, is how do I get the mean for pairs of subgroups?
For instance, the mean of a & b is 2? As if their values were combined.
The output would be something akin to:
column
index
a & a 2.5
a & b 2.0
a & c 2.5
b & b 1.5
b & c 2.0
c & c 2.5
Preferably this would involve manipulating the parameters in 'groupby' but as it is, I'm having to resort to looping and slicing. With the ability to build all combinations of subgroups at some point.
I've revisited this 3 years later with a general solution to this problem.
It's being used in this open source library, which is why I'm now able to do this here and it works with any number of indexes and creates combinations on them using numpy matrix broadcasting
So first of all, that is not a valid dataframe. The indexes aren't unique. Let's add another index to that object and make it a Series:
df = pd.DataFrame({
'unique': [1, 2, 3, 4, 5, 6],
'index': ['a','b','c','a','b','c'],
'column': [1,2,3,4,1,2]
}).set_index(['unique','index'])
s = df['column']
Let's unstack that index:
>>> idxs = ['index'] # set as variable to be used later on
>>> unstacked = s.unstack(idxs)
column
index a b c
unique
1 1.0 NaN NaN
2 NaN 2.0 NaN
3 NaN NaN 3.0
4 4.0 NaN NaN
5 NaN 1.0 NaN
6 NaN NaN 2.0
>>> vals = unstacked.values
array([[ 1., nan, nan],
[ nan, 2., nan],
[ nan, nan, 3.],
[ 4., nan, nan],
[ nan, 1., nan],
[ nan, nan, 2.]])
>>> sum = np.nansum(vals, axis=0)
>>> count = (~np.isnan(vals)).sum(axis=0)
>>> mean = (sum + sum[:, np.newaxis]) / (count + count[:, np.newaxis])
array([[ 2.5, 2. , 2.5],
[ 2. , 1.5, 2. ],
[ 2.5, 2. , 2.5]])
Now recreate the output dataframe:
>>> new_df = pd.DataFrame(mean, unstacked.columns, unstacked.columns.copy())
index_ a b c
index
a 2.5 2.0 2.5
b 2.0 1.5 2.0
c 2.5 2.0 2.5
>>> idxs_ = [ x+'_' for x in idxs ]
>>> new_df.columns.names = idxs_
>>> new_df.stack(idxs_, dropna=False)
index index_
a a 2.5
b 2.0
c 2.5
b a 2.0
b 1.5
c 2.0
c a 2.5
b 2.0
c 2.5
My current implementation is:
import pandas as pd
import itertools
import numpy as np
# get all pair of categories here
def all_pairs(df, ix):
hash = {
ix: [],
'p': []
}
for subset in itertools.combinations(np.unique(np.array(df.index)), 2):
hash[ix].append(subset)
hash['p'].append(df.loc[pd.IndexSlice[subset], :]).mean)
return pd.DataFrame(hash).set_index(ix)
Which gets the combinations and then adds them to the has that then builds back up into a data frame. It's hacky though :(
Here's an implementation that uses a MultiIndex and an outer join to handle the cross join.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
groupedDF = df.groupby(df.index).mean()
# Create new MultiIndex using from_product which gives a paring of the elements in each iterable
p = pd.MultiIndex.from_product([groupedDF.index, groupedDF.index])
# Add column for cross join
groupedDF[0] = 0
# Outer Join
groupedDF = pd.merge(groupedDF, groupedDF, how='outer', on=0).set_index(p)
# get mean for every row (which is the average for each pair)
# unstack to get matrix for deduplication
crossJoinMeans = groupedDF[['column_x', 'column_y']].mean(axis=1).unstack()
# Create Identity matrix because each pair of itself will be needed
b = np.identity(3, dtype='bool')
# set the first column to True because it contains the rest of the unique means (the identity portion covers the first part)
b[:,0] = True
# invert for proper use of DataFrame Mask
b = np.invert(b)
finalDF = crossJoinMeans.mask(b).stack()
I'd guess that this could be cleaned up and made more concise.

Filter numpy array by comparing elements to elements in prior row without looping

I am very new to Python and NumPy and have spent a couple of days searching for an answer to this question.
Consider the following 2D array of stock prices with columns 0 through 3 being the open, high, low and close prices with each row (0-6) being subsequent days.
O H L C
0 | 43.97 43.97 43.75 43.94
1 | 43.97 44.25 43.97 44.25
2 | 44.22 44.38 44.12 44.34
3 | 44.41 44.84 44.38 44.81
4 | 44.97 45.09 44.47 45.00
5 | 44.97 45.06 44.72 44.97
6 | 44.97 45.12 44.91 44.97
For this example I will use O, H, L, or C to represent columns 0-3, and 0, 1 or 2 to represent a row offset (backwards) for O, H, L or C.
H2 would mean the value of column H two rows back, and C0 would mean the value of column C in the current row. So in row 3, H2 would equal 44.25 and C0 would equal 44.81.
I would like to get the rows from this type of array using conditions that effectively equate to the logical statement C0 > H2 or similar statement. Ultimately I want to include multiple comparisons like this to return a subset of the array rows.
Is it possible to do this without looping through the array?
Generally speaking, you're wanting to do things like (to use your example of "C0 > H2"):
values = data[2:][C[2:] > H[:-2]]
However, you can easily see how this becomes repetitive.
Therefore, it's easiest to make new sequences of "H2", etc that are the same length as the rest of your data. When you do this, you need some way to indicate which values are invalid or insert valid values.
There's more than one way to handle this (e.g. different boundary conditions, masked arrays, etc). For example, you could decide to extend the series with the last valid value, instead.
For the moment, because you have floating point arrays, let's insert NaN's into the missing positions. That way any comparisons will return False.
In that case, you'd do something like:
H2 = np.pad(H[:-2], (2, 0), mode='constant', constant_values=(np.nan,))
or more generally:
def shift(data, amount):
data = data[:-amount]
pad = (amount, 0)
return np.pad(data, pad, mode='constant', constant_values=(np.nan,))
That way you can directly compare things. E.g. H[H > shift(H, 2)]
Also, as DSM mentioned, look into using pandas for this. It will make your life much easier in general, and the equivalent expression would be:
df[df.C > df.H.shift(2)]
Detailed Explanation
Let's break that down a bit.
If we start with the series x = [0, 1, 2, 3, 4, 5], then x[:-2] will give us [0, 1, 2, 3]
import numpy as np
x = np.arange(6)
x2 = x[2:]
However, if we wanted to compare it with some other sequence of the same original length, we have a problem, as x is now two items shorter than the other sequence.
y = np.linspace(-2, -3, 6)
and comparing them will raise a ValueError, as they're not the same length:
In [4]: x2 > y
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-eec160476995> in <module>()
----> 1 x2 > y
ValueError: operands could not be broadcast together with shapes (4) (6)
Furthermore, we don't want to compare the first value of the new "shifted" x with the first value of the original sequence. We want to compare the first item of the "shifted" sequence with the third item of the original sequence.
To do this, we need to slice the other sequence as well. E.g. y[2:]:
In [5]: x2 > y[2:]
Out[5]: array([ True, True, True, True], dtype=bool)
However, this is a bit clunky. We need to know how much x2 has been shifted by to use it properly. It's much easier to insert new values into x2 so that we can index directly with it.
In my original example, I used np.pad to insert NaNs at the start of the array.
x2 = np.pad(x[:-2], (2, 0), mode='constant', constant_values=(np.nan,))
The necessary arguments to pad are a touch awkward in this case. If you'd prefer not to use np.pad, you could also do something similar to the following:
x2 = np.hstack([2 * [np.nan], x[:-2]])
The big advantage of either of these approaches is that we have arrays of the same length, and any comparisons against np.nan will be False.
For example:
In [9]: x2
Out[9]: array([ nan, nan, 0., 1., 2., 3.])
In [10]: x2 > -np.inf
Out[10]: array([False, False, True, True, True, True], dtype=bool)
This makes it easy to directly compare with y:
In [11]: y
Out[11]: array([-2. , -2.2, -2.4, -2.6, -2.8, -3. ])
In [12]: x2 > y
Out[12]: array([False, False, True, True, True, True], dtype=bool)
Examples
As a more complete example:
import numpy as np
def main():
data = np.array([[43.97, 43.97, 43.75, 43.94],
[43.97, 44.25, 43.97, 44.25],
[44.22, 44.38, 44.12, 44.34],
[44.41, 44.84, 44.38, 44.81],
[44.97, 45.09, 44.47, 45.00],
[44.97, 45.06, 44.72, 44.97],
[44.97, 45.12, 44.91, 44.97]])
O, H, L, C = data.T
values = data[C > shift(H, 2)]
print values
def shift(data, amount):
data = data[:-amount]
pad = (amount, 0)
return np.pad(data, pad, mode='constant', constant_values=(np.nan,))
main()
values is then:
[[ 44.22 44.38 44.12 44.34]
[ 44.41 44.84 44.38 44.81]
[ 44.97 45.09 44.47 45. ]
[ 44.97 45.06 44.72 44.97]]
And just to show a pandas version, as well:
import pandas as pd
df = pd.DataFrame([[43.97, 43.97, 43.75, 43.94],
[43.97, 44.25, 43.97, 44.25],
[44.22, 44.38, 44.12, 44.34],
[44.41, 44.84, 44.38, 44.81],
[44.97, 45.09, 44.47, 45.00],
[44.97, 45.06, 44.72, 44.97],
[44.97, 45.12, 44.91, 44.97]],
columns=['O', 'H', 'L', 'C'])
values = df[df.C > df.H.shift(2)]
print values
Which yields:
O H L C
2 44.22 44.38 44.12 44.34
3 44.41 44.84 44.38 44.81
4 44.97 45.09 44.47 45.00
5 44.97 45.06 44.72 44.97

Categories