How to get columns index which meet some condition in pandas? - python

I have the following:
x = pd.DataFrame({'a':[1,5,5], 'b':[7,0,7]})
And for every row, i want to get the index of the first column that met the condition that it's value is greater than some value, let's say greater than
4.
In this example, the answer is 1, (correspond to the index of the value 7 in the first row) and 0 (correspond to the index of the value 5 in the second row), and 1(correspond to the index of the value 5 in the third row).
Which means the answer is [1,0,0].
I tried it with apply method:
def get_values_from_row(row, th=0.9):
"""Get a list of column names that meet some condition that their values are larger than a threshold.
Args:
row(pd.DataFrame): a row.
th(float): the threshold.
Returns:
string. contains the columns that it's value met the condition.
"""
return row[row > th].index.tolist()[0]
It works, but i have a large data set, and it's quite slow.
What is a better alternative.

I think you need first_valid_index with get_loc:
print (x[x > 4])
a b
0 NaN 7.0
1 5.0 NaN
2 7.0 5.0
print (x[x > 4].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1))
0 1
1 0
2 0
dtype: int64

Related

How would I write a pandas apply lambda function that compares the value to the same index value in an outside list?

Say I have a simple dataframe and also a list of numbers:
df = pd.DataFrame({'column1':[1,2,3,4,5]})
list_of_nums = [0,3,2,5,6]
I want to create 'column2' in df based on whether the respective row value is greater than or less than the respective index in list_of_nums.
column2 would be:
['Greater','Less','Greater','Less','Less']
I was trying something like this:
def compare_to_list(row):
if row > list_of_nums[row.index]:
return 'Greater'
else:
return 'Less'
df['column1'].apply(lambda x: compare_to_list(x))
However it's not able to access the row index in order to index the corresponding value in the list.
You can use np.where() to compare the column with the list and assign Greater / Less accordingly, as follows:
df['column2'] = np.where(df['column1'] > list_of_nums, 'Greater', 'Less')
Result:
print(df)
column1 column2
0 1 Greater
1 2 Less
2 3 Greater
3 4 Less
4 5 Less
try via pd.Series() method and map() method:
df['column2']=((df['column1']>pd.Series(list_of_nums))
.map({True:'Greater',False:'Less'}))
Output of df:
column1 column2
0 1 Greater
1 2 Less
2 3 Greater
3 4 Less
4 5 Less

Groupby when given the start positional index of each group

I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).
Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64
Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64
Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64
Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.
My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5
Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

Extracting Max value along each row from strings in column

I've a column of strings in a DataFrame which contains comma-separated numbers. I need to extract the maximum value along each row from the strings. The maximum value returned should be the max till the 13th index from the beginning.
I've tried splitting the sting using ',' as a separator to convert it into a list with expand option enabled. Then I'm using the assign method of Pandas to find the max value along the vertical axis.
sample_dt1 = sample_dt['pyt_hist'].str.split(',', expand=True).astype(float)
sample_dt = sample_dt.assign(max_value=sample_dt1.max(axis=1))
Sample Data:
index pyt_hist
0 0,0,0,0,0,0,0,0,0,0,0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2 0,0,0,360,420,392,361,330,300,269,239,208,177
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0
Expected Result:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 0
Results obtained using my code:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0.0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0.0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420.0
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 56.0
You are very close, sample_dt1.iloc[:,:13] gives you the first 13 columns of sample_dt1. So you can do:
sample_dt = sample_dt.assign(max_value=sample_dt1.iloc[:,:13].max(axis=1))
df.pyt_hist.str.split(',').apply(lambda x: max([int(i) for i in x[:13] if i]))
Output
0 0
1 0
2 420
3 0
Name: pyt_hist, dtype: int64

How to access values returned by the column names idxmin/idxmax?

Let's say I have this dataframe
> df = pd.DataFrame({'A': [1,5], 'B':[3,4]})
A B
0 1 3
1 5 4
I can get the minimum value of each row with:
> df.min(1)
0 1
1 4
dtype: int64
Or its indexes with:
> df.idxmin(1)
0 A
1 B
dtype: object
Nevertheless, this implies searching the minimum values twice. Is there a way to use the idxmin results to access the respective columns and get the minimum value (without calling min)?
Edit: I am looking for something that is faster than calling min again. In theory, this should be possible as columns are indexed.
To get the values in a list, you could do the following:
> indicies = df.idxmin(1)
> [df.iloc[k][indicies[k]] for k in range(len(indicies))]
[1, 4]

Index a DataFrame with a list and return NaN for out-of-bounds indices in Pandas?

Normally when I index a DataFrame (or a Series) with a list of integer indices, I get back a subset of the rows, unless some of my indices are out of bounds, in which case I get an IndexError:
s = pd.Series(range(4))
0 0
1 1
2 2
3 3
s.iloc[[1,3]]
1 1
3 3
s.iloc[[1,3,5]]
IndexError
But I'd like to get back a DataFrame (or Series) having an index identical to the list I queried with (i.e., parallel to the query list), with (the rows corresponding to) any out-of-bounds indices filled in with NaN :
s.something[[1,3,5]]
1 1
3 3
5 NaN
I don't think join tricks work because those want to operate on the DataFrame index (or columns). As far as I can tell there's not even an "iget" integer-based get method if I wanted to manually loop over the indices myself. That leaves something like:
indices = [1,3,5]
pd.Series([s.iloc[i] if 0 <= i < len(s) else np.nan for i in indices], index=indices)
Is that the best Pandas 0.18 can do?
You can use reindex to achieve this:
In [119]:
s.reindex([1,3,5])
Out[119]:
1 1
3 3
5 NaN
dtype: float64
This will use the passed index and return existing values or NaN
Thanks to #EdChum for inspiration, the general solution is:
s.reset_index(drop=True).reindex([1,3,5])

Categories