Groupby when given the start positional index of each group - python

I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).

Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64

Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64

Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64

Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.

My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5

Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

Related

Manipulate string values in pandas

I have a pandas dataframe with different formats for one column like this
Name
Values
First
5-9
Second
7
Third
-
Fourth
12-16
I need to iterate over all Values column, and if the format is like the first row 5-9 or like fourth row 12-16 replace it with the mean between the 2 numbers in string.
For first row replace 5-9 to 7, or for fourth row replace 12-16 to 14.
And if the format is like third row - replace it to 0
I have tried
if df["Value"].str.len() > 1:
df["Value"] = df["Value"].str.split('-')
df["Value"] = (df["Value"][0] + df["Value"][1]) / 2
elif df["Value"].str.len() == 1:
df["Value"] = df["Value"].str.replace('-', 0)
Expected output
Name
Values
First
7
Second
7
Third
0
Fourth
14
Let us split and expand the column then cast values to float and calculate mean along column axis:
s = df['Values'].str.split('-', expand=True)
df['Values'] = s[s != ''].astype(float).mean(1).fillna(0)
Name Values
0 First 7.0
1 Second 7.0
2 Third 0.0
3 Fourth 14.0
You can use str.replace with customized replacement function
mint = lambda s: int(s or 0)
repl = lambda m: str(sum(map(mint, map(m.group, [1,2])))/2)
df['Values'] = df['Values'].str.replace('(\d*)-(\d*)', repl, regex=True)
print(df)
Name Values
0 First 7.0
1 Second 7
2 Third 0.0
3 Fourth 14.0

Finding out if values in dataframe increases in tens place

I'm trying to figure out if the value in my dataframe is increasing in the tens/hundreds place. For example I created a dataframe with a few values, I duplicate the values and shifted them and now i'm able to compare them. But how do i code and find out if the tens place is increasing or if it just increasing by a little, for example 0.02 points.
import pandas as pd
import numpy as np
data = {'value':['9','10','19','22','31']}
df = pd.DataFrame(data)
df['value_copy'] = df['value'].shift(1)
df['Increase'] = np.where(df['value']<df['value_copy'],1,0)
output should be in this case:
[nan,1,0,1,1]
IIUC, divide by 10, get the floor, then compare the successive values (diff(1)) to see if the difference is exactly 1:
np.floor(df['value'].astype(float).div(10)).diff(1).eq(1).astype(int)
If you want a jump to at least the next tens (or more) use ge (≥):
np.floor(df['value'].astype(float).div(10)).diff(1).ge(1).astype(int)
output:
0 0
1 1
2 0
3 1
4 1
Name: value, dtype: int64
NB. if you insist on the NaN:
s = np.floor(df['value'].astype(float).div(10)).diff(1)
s.eq(1).astype(int).mask(s.isna())
output:
0 NaN
1 1.0
2 0.0
3 1.0
4 1.0
Name: value, dtype: float64

Create a new column in a dataframe and add 1 to the previous row of that column

I am looking to derive a new row from a current row in my dataframe, and add 1 to the previous row to keep a kind of running total
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()) | (df.Time_btween_steps > 30), 1, df.First_touch.shift().add(1))
I basically want to check if the column value is null, if it is then set that to "First Activity"/resets the counter, if not, add 1 to the "previous activity", to give me a running total of the number of outreach we are doing on specific people:
Expected outcome:
Time Between Steps | Touch_No
Null. |. 1
0 |. 2
5.4 |. 3
6.7 |. 4
2 |. 5
null |. 1
1 |. 2
Answer using this. Combo of cumsum(), groupBy(), and cumcount()
df = pd.DataFrame(data=[None, 0, 5.4, 6.7, 2, None, 1], columns=['Time_btween_steps'])
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()), (df.Time_btween_steps > 30), 1)
df['consec'] = df['Touch_No'].groupby((df['Touch_No']==0).cumsum()).cumcount()
df.head(10)
Edited according to your clarification:
df = pd.DataFrame(data=np.array(([None, 0, 5.4, 6.7, 2, None, 1],[50,1,2,3,4,35,1])).T, columns=['Time_btween_steps', 'Touch_No'])
mask = pd.isna(df['Time_btween_steps']) | df['Time_btween_steps']>30
df['Touch_No'][~mask] += 1
df['Touch_No'][mask] = 1
Returns:
Time_btween_steps Touch_No
0 None 51
1 0 2
2 5.4 3
3 6.7 4
4 2 5
5 None 36
6 1 2
In my opinion a solution like this is much more readable. We increment by 1 where the condition is not met, and we set the ones where the condition is true to 1. You can combine these into a single line if you wish.
Old answer for posterity.
Here is a simple solution using pandas apply functionality which takes a function.
import pandas as pd
df = pd.DataFrame(data=[1,2,3,4,None,5,0],columns=['test'])
df.test.apply(lambda x: 0 if pd.isna(x) else x+1)
Which returns:
0 2.0
1 3.0
2 4.0
3 5.0
4 0.0
5 6.0
6 1.0
Here I wrote the function in place but if you have more complicated logic, such as resetting if the number is something else, etc., you can write a custom function and pass it in instead of the lambda function. This is not the only way to do it, but if your data frame isn't huge (hundreds of thousands of rows), it should be performant. If you don't want a copy but to overwrite the array simply assign it back by prepending:
df['test'] = before the last line.
If you want the output to be ints, you can also do:
df['test'].astype(int) but be careful about converting None/Null to int.
Using np.where, index values with ffill for partitioning and simple rank:
import numpy as np
import pandas as pd
sodf = pd.DataFrame({'time_bw_steps': [None, 0, 5.4, 6.7, 2, None, 1]})
sodf['touch_partition'] = np.where(sodf.time_bw_steps.isna(), sodf.index, np.NaN)
sodf['touch_partition'] = sodf['touch_partition'].fillna(method='ffill')
sodf['touch_no'] = sodf.groupby('touch_partition')['touch_partition'].rank(method='first', ascending='False')
sodf.drop(columns=['touch_partition'], axis='columns', inplace=True)
sodf

How to filter Pandas rows based on last/next row?

I have two data sets from different pulse oximeters, and plot them with pyplot as displayed below. As you may see, the green data sheet has alot of outliers(vertical drops). In my work I've defined these outlayers as non-valid in for my statistical analysis, they are must certainly not measurements. Therefore I argue that I can simply remove them.
The characteristics of these rogue values is that they're single(or top two) value outliers(see df below). The "real" sample values are either the same as the previous value, or +-1. In e.g. java(pseudo code) I would do something like:
for(i; i <df.length; i++)
if (df[i+1|-1].spo2 - df[i].spo2 > 1|-1)
df[i].drop
What would be the pandas(numpy?) equivalent of what I'm trying to do, remove values that is more/less than 1 compared to the last/next value?
df:
time, spo2
1900-01-01 18:18:41.194 98.0
1900-01-01 18:18:41.376 98.0
1900-01-01 18:18:41.559 78.0
1900-01-01 18:18:41.741 98.0
1900-01-01 18:18:41.923 98.0
1900-01-01 18:18:42.105 90.0
1900-01-01 18:18:42.288 97.0
1900-01-01 18:18:42.470 97.0
1900-01-01 18:18:42.652 98.0
have a look at pandas.DataFrame.shift. This is a column-wise operation that shifts all rows in a given column to another row of another column:
# original df
x1
0 0
1 1
2 2
3 3
4 4
# shift down
df.x2 = df.x1.shift(1)
x1 x2
0 0 NaN # Beware
1 1 0
2 2 1
3 3 2
4 4 3
# Shift up
df.x2 = df.x1.shift(-1)
x1 x2
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN # Beware
You can use this to move spo2 of timestamp n+1 next to spo2 in the timestamp n row. Then, filter based on conditions applied to that one row.
df['spo2_Next'] = df['spo2'].shift(-1)
# replace NaN to allow float comparison
df.spo2_Next.fillna(1, inplace = True)
# Apply your row-wise condition to create filter column
df.loc[((df.spo2_Next - df.spo2) > 1) or ((df.spo2_Next - df.spo2) < 1), 'Outlier'] = True
# filter
df_clean = df[df.Outlier != True]
# remove filter column
del df_clean['Outlier']
When you filter a pandas dataframe like:
df[ df.colum1 = 2 & df.colum2 < 3 ], you are:
comparing a numeric series to a scalar value and generating a boolean series
obtaining two boolean series and doing a logical and
then using a numeric series to filter the data frame (the false values will not be added in the new data frame)
So you just need create an iterative algorithm over the data frame to produce such boolean array, and use it to filter the dataframe, as in:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df[ [True, False, True]]
You can also create a closure to filter the data frame (using df.apply), and keeping previous observations in the closure to detect abrupt changes, but this would be way too complicated. I would go for the straightforward imperative solution.

How to get columns index which meet some condition in pandas?

I have the following:
x = pd.DataFrame({'a':[1,5,5], 'b':[7,0,7]})
And for every row, i want to get the index of the first column that met the condition that it's value is greater than some value, let's say greater than
4.
In this example, the answer is 1, (correspond to the index of the value 7 in the first row) and 0 (correspond to the index of the value 5 in the second row), and 1(correspond to the index of the value 5 in the third row).
Which means the answer is [1,0,0].
I tried it with apply method:
def get_values_from_row(row, th=0.9):
"""Get a list of column names that meet some condition that their values are larger than a threshold.
Args:
row(pd.DataFrame): a row.
th(float): the threshold.
Returns:
string. contains the columns that it's value met the condition.
"""
return row[row > th].index.tolist()[0]
It works, but i have a large data set, and it's quite slow.
What is a better alternative.
I think you need first_valid_index with get_loc:
print (x[x > 4])
a b
0 NaN 7.0
1 5.0 NaN
2 7.0 5.0
print (x[x > 4].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1))
0 1
1 0
2 0
dtype: int64

Categories