Say I have a simple dataframe and also a list of numbers:
df = pd.DataFrame({'column1':[1,2,3,4,5]})
list_of_nums = [0,3,2,5,6]
I want to create 'column2' in df based on whether the respective row value is greater than or less than the respective index in list_of_nums.
column2 would be:
['Greater','Less','Greater','Less','Less']
I was trying something like this:
def compare_to_list(row):
if row > list_of_nums[row.index]:
return 'Greater'
else:
return 'Less'
df['column1'].apply(lambda x: compare_to_list(x))
However it's not able to access the row index in order to index the corresponding value in the list.
You can use np.where() to compare the column with the list and assign Greater / Less accordingly, as follows:
df['column2'] = np.where(df['column1'] > list_of_nums, 'Greater', 'Less')
Result:
print(df)
column1 column2
0 1 Greater
1 2 Less
2 3 Greater
3 4 Less
4 5 Less
try via pd.Series() method and map() method:
df['column2']=((df['column1']>pd.Series(list_of_nums))
.map({True:'Greater',False:'Less'}))
Output of df:
column1 column2
0 1 Greater
1 2 Less
2 3 Greater
3 4 Less
4 5 Less
I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).
Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64
Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64
Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64
Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.
My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5
Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))
I've a column of strings in a DataFrame which contains comma-separated numbers. I need to extract the maximum value along each row from the strings. The maximum value returned should be the max till the 13th index from the beginning.
I've tried splitting the sting using ',' as a separator to convert it into a list with expand option enabled. Then I'm using the assign method of Pandas to find the max value along the vertical axis.
sample_dt1 = sample_dt['pyt_hist'].str.split(',', expand=True).astype(float)
sample_dt = sample_dt.assign(max_value=sample_dt1.max(axis=1))
Sample Data:
index pyt_hist
0 0,0,0,0,0,0,0,0,0,0,0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2 0,0,0,360,420,392,361,330,300,269,239,208,177
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0
Expected Result:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 0
Results obtained using my code:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0.0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0.0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420.0
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 56.0
You are very close, sample_dt1.iloc[:,:13] gives you the first 13 columns of sample_dt1. So you can do:
sample_dt = sample_dt.assign(max_value=sample_dt1.iloc[:,:13].max(axis=1))
df.pyt_hist.str.split(',').apply(lambda x: max([int(i) for i in x[:13] if i]))
Output
0 0
1 0
2 420
3 0
Name: pyt_hist, dtype: int64
Normally when I index a DataFrame (or a Series) with a list of integer indices, I get back a subset of the rows, unless some of my indices are out of bounds, in which case I get an IndexError:
s = pd.Series(range(4))
0 0
1 1
2 2
3 3
s.iloc[[1,3]]
1 1
3 3
s.iloc[[1,3,5]]
IndexError
But I'd like to get back a DataFrame (or Series) having an index identical to the list I queried with (i.e., parallel to the query list), with (the rows corresponding to) any out-of-bounds indices filled in with NaN :
s.something[[1,3,5]]
1 1
3 3
5 NaN
I don't think join tricks work because those want to operate on the DataFrame index (or columns). As far as I can tell there's not even an "iget" integer-based get method if I wanted to manually loop over the indices myself. That leaves something like:
indices = [1,3,5]
pd.Series([s.iloc[i] if 0 <= i < len(s) else np.nan for i in indices], index=indices)
Is that the best Pandas 0.18 can do?
You can use reindex to achieve this:
In [119]:
s.reindex([1,3,5])
Out[119]:
1 1
3 3
5 NaN
dtype: float64
This will use the passed index and return existing values or NaN
Thanks to #EdChum for inspiration, the general solution is:
s.reset_index(drop=True).reindex([1,3,5])