Getting all rows with NaN value - python

I have a table with a column that has some NaN values in it:
A B C D
2 3 2 Nan
3 4 5 5
2 3 1 Nan
I'd like to get all rows where D = NaN. How can I do this?

Creating a df for illustration (containing Nan)
In [86]: df =pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[np.nan, 4,5]})
In [87]: df
Out[87]:
a b c
0 1 3 NaN
1 2 4 4
2 3 5 5
Checking which indices have null for column c
In [88]: pd.isnull(df['c'])
Out[88]:
0 True
1 False
2 False
Name: c, dtype: bool
Checking which indices dont have null for column c
In [90]: pd.notnull(df['c'])
Out[90]:
0 False
1 True
2 True
Name: c, dtype: bool
Selecting rows of df where c is not null
In [91]: df[pd.notnull(df['c'])]
Out[91]:
a b c
1 2 4 4
2 3 5 5
Selecting rows of df where c is null
In [93]: df[pd.isnull(df['c'])]
Out[93]:
a b c
0 1 3 NaN
Selecting rows of column c of df where c is not null
In [94]: df['c'][pd.notnull(df['c'])]
Out[94]:
1 4
2 5
Name: c, dtype: float64

For a solution that doesn't involve pandas, you can do something like:
goodind=np.where(np.sum(np.isnan(y),axis=1)==0)[0] #indices of rows non containing nans
(or the negation if you want rows with nan) and use the indices to slice data.
I am not sure sum is the best way to combine booleans, but np.any and np.all don't seem to have a axis parameter, so this is the best way I found.

Related

How to select values from a dataframe by a series of column names?

I have a dataframe df:
A B
0 1 4
1 2 5
2 3 6
And a series s:
0 A
1 B
2 A
Now I want to pick values from df with column names specified in s. The expected result is:
0 1 <- from column A
1 5 <- from column B
2 3 <- from column A
How can I get this done efficiently?
Use Index.get_indexer for indices by Series and select values by numpy indexing in 2d array:
a = df.to_numpy()
b = a[np.arange(len(df)), df.columns.get_indexer(s)]
print (b)
[1 5 3]
s1 = pd.Series(b, s.index)
print (s1)
0 1
1 5
2 3
dtype: int64

Pandas Series with column names for each value above a minimum

I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!
UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object

Select columns in pandas dataframe by value in rows

I have a pandas.DataFrame with too much columns. I want to select all columns with values in rows equals to 0 and 1. Type of all columns is int64 and I can't select they by object or other type. How can I do this?
IIUC then you can use isin and filter the columns:
In [169]:
df = pd.DataFrame({'a':[0,1,1,0], 'b':list('abcd'), 'c':[1,2,3,4]})
df
Out[169]:
a b c
0 0 a 1
1 1 b 2
2 1 c 3
3 0 d 4
In [174]:
df[df.columns[df.isin([0,1]).all()]]
Out[174]:
a
0 0
1 1
2 1
3 0
The output from the inner condition:
In [175]:
df.isin([0,1]).all()
Out[175]:
a True
b False
c False
dtype: bool
We can use the boolean mask to filter the columns:
In [176]:
df.columns[df.isin([0,1]).all()]
Out[176]:
Index(['a'], dtype='object')

pandas - Going from aggregated format to long format

If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a

Pandas: Selection with MultiIndex

Considering the following DataFrames
In [136]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'C':np.arange(10,30,5)}).set_index(['A','B'])
df
Out[136]:
C
A B
1 1 10
2 15
2 1 20
2 25
In [130]:
vals = pd.DataFrame({'A':[1,2],'values':[True,False]}).set_index('A')
vals
Out[130]:
values
A
1 True
2 False
How can I select only the rows of df with corresponding True values in vals?
If I reset_index on both frames I can now merge/join them and slice however I want, but how can I do it using the (multi)indexes?
boolean indexing all the way...
In [65]: df[pd.Series(df.index.get_level_values('A')).isin(vals[vals['values']].index)]
Out[65]:
C
A B
1 1 10
2 15
Note that you can use xs on a multiindex.
In [66]: df.xs(1)
Out[66]:
C
B
1 10
2 15

Categories