How to groupby and count since the first non nan value? - python

I have this df:
CODE TMAX
0 000130 NaN
1 000130 NaN
2 000130 32.0
3 000130 32.2
4 000130 NaN
5 158328 NaN
6 158328 8.8
7 158328 NaN
8 158328 NaN
9 158328 9.2
... ... ...
I want to count the number of non nan values and the number of nan values in the 'TMAX' column. But i want to count since the first non NaN value and by code.
Expected result in code 000130: 2 non nan values and 1 NaN values.
Expected result in code 158328: 2 non nan values and 2 NaN values.
Same with the other codes...
How can i do this?
Thanks in advance.

If need CODEs too add GroupBy.cummax and count values by crosstab:
m = df.TMAX.notna()
s = m[m.groupby(df['CODE']).cummax()]
df1 = pd.crosstab(df['CODE'], s).rename(columns={True:'non NaNs',False:'NaNs'})
print (df1)
TMAX NaNs non NaNs
CODE
130 1 2
158328 2 2
If need explicitely filter also column CODE by mask:
m = df.TMAX.notna()
mask = m.groupby(df['CODE']).cummax()
df1 = pd.crosstab(df.loc[mask, 'CODE'], m[mask]).rename(columns={True:'non NaNs',False:'NaNs'})

Use first_valid_index to find the first non-NaN index and filter. Then use isna to create a boolean mask and count the values.
def countNaN(s):
return (
s.loc[s.first_valid_index():]
.isna()
.value_counts()
.rename({True: 'NaN', False: 'notNaN'})
)
df.groupby('CODE').apply(countNaN)
Output
CODE
000130 notNaN 2
NaN 1
158328 notNaN 2
NaN 2

Related

Is it possible to clip and fill other value with NAN in pandas?

To keep only the positive values, we can clip a dataframe or a specific column(s) of a dataframe using
df.clip(lower = 0)
But it replaces all negative values with zero. Is it possible to keep only non-negative values and replace all other with Nan?
I looked in this pandas documentation, but fill method is not here.
Another way is to replace all zeros with Nan but it will also convert those values which were actually Zero.
Use DataFrame.mask with DataFrame.lt or DataFrame.le:
print (df)
a b c
0 2 -6 8
1 -5 -8 0
df1 = df.mask(df.lt(0))
print (df1)
a b c
0 2.0 NaN 8
1 NaN NaN 0
df2 = df.mask(df.le(0))
print (df2)
a b c
0 2.0 NaN 8.0
1 NaN NaN NaN

Positional indexing with NA values

I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok
If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0

How can I select the rows that have only one element non nan value and all the the rest are NaN?

I am trying to extract from a dataframe the rows that have only element no-Nan and the rest are None.
For example :
A B C
0 NaN NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
For this example of dataframe it should return the first row.
I tried this code but it doesn't work:
df_table.isnull(df_table[cols]).all(axis=1)
Thanks!
Use sum instead of all:
df.loc[df.notnull().sum(1)==1]
To get the non-nan elements, you can use, for example, max:
df.loc[df.notnull().sum(1)==1].max(1)
or
df.loc[df.notnull().sum(1)==1].ffill(1).iloc[:,-1]
which gives:
0 2.0
1 3.0
dtype: float64

Pandas: interpolation where first and last data point in column is NaN

I would like to use the interpolate function, but only between known data values in a pandas DataFrame column. The issue is that the first and last values in the column are often NaN and sometimes it can be many rows before a value is not NaN:
col 1 col 2
0 NaN NaN
1 NaN NaN
...
1000 1 NaN
1001 NaN 1 <-----
1002 3 NaN <----- only want to fill in these 'in between value' rows
1003 4 3
...
3999 NaN NaN
4000 NaN NaN
I am tying together a dataset which is updated 'on event' but separately for each column, and is indexed via Timestamp. This means that there are often rows where no data is recorded for some columns, hence a lot of NaNs!
I select by min and max value of column by function idxmin and idxmax and use function fillna with method forward filling.
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 NaN 1
#1002 3 NaN
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()] = df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()].fillna(method='ffill')
df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()] = df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()].fillna(method='ffill')
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 1 1
#1002 3 1
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
Added different solution, thanks HStro.
df['col 1'].loc[df['col 1'].first_valid_index() : df['col 1'].last_valid_index()] = df['col 1'].loc[df['col 1'].first_valid_index(): df['col 1'].last_valid_index()].astype(float).interpolate()

Best way to eliminate columns with only one value from pandas dataframe

i'm trying to build a function to eliminate from my dataset the columns with only one value. I used this function:
def oneCatElimination(dataframe):
columns=dataframe.columns.values
for column in columns:
if len(dataframe[column].value_counts().unique())==1:
del dataframe[column]
return dataframe
the problem is that the function eliminates even column with more the one distinct value, i.e. a index column with integer number..
Just
df.dropna(thresh=2, axis=1)
will work. No need for anything else. It will keep all columns with 2 or more non-NA values (controlled by the value passed to thresh). The axis kwarg will let you work with rows or columns. It is rows by default, so you need to pass axis=1 explicitly to work on columns (I forgot this at the time I answered, hence this edit). See dropna() for more information.
A couple of assumptions went into this:
Null/NA values don't count
You need multiple non-NA values to keep a column
Those values need to be different in some way (e.g., a column full of 1's and only 1's should be dropped)
All that said, I would use a select statement on the columns.
If you start with this dataframe:
import pandas
N = 15
df = pandas.DataFrame(index=range(10), columns=list('ABCD'))
df.loc[2, 'A'] = 23
df.loc[3, 'B'] = 52
df.loc[4, 'B'] = 36
df.loc[5, 'C'] = 11
df.loc[6, 'C'] = 11
df.loc[7, 'D'] = 43
df.loc[8, 'D'] = 63
df.loc[9, 'D'] = 97
df
Which creates:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 23 NaN NaN NaN
3 NaN 52 NaN NaN
4 NaN 36 NaN NaN
5 NaN NaN 11 NaN
6 NaN NaN 11 NaN
7 NaN NaN NaN 43
8 NaN NaN NaN 63
9 NaN NaN NaN 97
Given my assumptions above, columns A and C should be dropped since A only has one value and both of C's values are the same. You can then do:
df.select(lambda c: df[c].dropna().unique().shape[0] > 1, axis=1)
And that gives me:
B D
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 52 NaN
4 36 NaN
5 NaN NaN
6 NaN NaN
7 NaN 43
8 NaN 63
9 NaN 97
This will work for both text and numbers:
for col in dataframe:
if(len(dataframe.loc[:,col].unique()) == 1):
dataframe.pop(col)
Note: This will remove the columns having only one value from the original dataframe.

Categories