Finding consecutive segments in a pandas data frame - python

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1: 3
2: 3
3: 3
4: 3
5: 4
6: 4
7: 4
8: 4
9: 1
10: 1
11: 1
12: 1
13: 1
Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]
The result might also be in something different than plain lists.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.

One-liner:
df.reset_index().groupby('A')['index'].apply(np.array)
Code for example:
In [1]: import numpy as np
In [2]: from pandas import *
In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
In [4]: df
Out[4]:
A
0 3
1 3
2 3
3 3
4 4
5 4
6 4
7 4
8 1
9 1
10 1
11 1
In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
Out[5]:
A
1 [8, 9, 10, 11]
3 [0, 1, 2, 3]
4 [4, 5, 6, 7]
You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A')
In [2]: grp.indices
Out[2]:
{1L: array([ 8, 9, 10, 11], dtype=int64),
3L: array([0, 1, 2, 3], dtype=int64),
4L: array([4, 5, 6, 7], dtype=int64)}
In [3]: grp.indices[3]
Out[3]: array([0, 1, 2, 3], dtype=int64)
To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()
In [2]: df
Out[2]:
A block
0 3 1
1 3 1
2 3 1
3 3 1
4 4 2
5 4 2
6 4 2
7 4 2
8 1 3
9 1 3
10 1 3
11 1 3
12 3 4
13 3 4
14 3 4
15 3 4
Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
Out[77]:
A block
1 3 [8, 9, 10, 11]
3 1 [0, 1, 2, 3]
4 [12, 13, 14, 15]
4 2 [4, 5, 6, 7]

You could use np.diff() to test where a segment starts/ends and iterate over those results. Its a very simple solution, so probably not the most performent one.
a = np.array([3,3,3,3,3,4,4,4,4,4,1,1,1,1,4,4,12,12,12])
prev = 0
splits = np.append(np.where(np.diff(a) != 0)[0],len(a)+1)+1
for split in splits:
print np.arange(1,a.size+1,1)[prev:split]
prev = split
Results in:
[1 2 3 4 5]
[ 6 7 8 9 10]
[11 12 13 14]
[15 16]
[17 18 19]

Related

Pandas dataframe with N columns

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.
You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Python use dataframe column value in iloc (or shift)

Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?
Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

Extract data from a dataframe

I have a list based upon which I want to retrieve data from a dataset.
Here is the list:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
and here is the dataset
There are two items with multiple quantity i.e. 3 and 7
I want to extract those rows which are not in packed list. In this case its two times 7(rest 3 are in the list already)
How can I do that? I tried this but this doesn't work
new_df= data[~data["Pid"].isin(packed)].reset_index(drop=True)
Use GroupBy.cumcount with helper DataFrame, merge with left join and indicator=True and last filter by boolean indexing:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
df1 = pd.DataFrame({'Pid':packed})
df1['g'] = df1.groupby('Pid').cumcount()
print (df1)
Pid g
0 1 0
1 5 0
2 8 0
3 2 0
4 3 0
5 3 1
6 7 0
7 3 2
8 7 1
9 7 2
10 4 0
11 6 0
12 3 3
data['g'] = data.groupby('Pid').cumcount()
new_df = data[data.merge(df1, indicator=True, how='left')['_merge'].eq('left_only')]

How to select ranges of values in pandas?

Newbie question.
My dataframe looks like this:
class A B
0 1 3.767809 11.016
1 1 2.808231 4.500
2 1 4.822522 1.008
3 2 5.016933 -3.636
4 2 6.036203 -5.220
5 2 7.234567 -6.696
6 2 5.855065 -7.272
7 4 4.116770 -8.208
8 4 2.628000 -10.296
9 4 1.539184 -10.728
10 3 0.875918 -10.116
11 3 0.569210 -9.072
12 3 0.676379 -7.632
13 3 0.933921 -5.436
14 3 0.113842 -3.276
15 3 0.367129 -2.196
16 1 0.968661 -1.980
17 1 0.160997 -2.736
18 1 0.469383 -2.232
19 1 0.410463 -2.340
20 1 0.660872 -2.484
I would like to get groups where class is the same, like:
class 1: rows 0..2
class 2: rows 3..6
class 4: rows 7..9
class 3: rows 10..15
class 1: rows 16..20
The reason is that order matters. In requirements I have that class 4 can be only between 1 and 2 and if after prediction we have class 4 after 2 it should be considered as 2.
Build a New Para to identify the group
df['group']=df['class'].diff().ne(0).cumsum()
df.groupby('group')['group'].apply(lambda x : x.index)
Out[106]:
group
1 Int64Index([0, 1, 2], dtype='int64')
2 Int64Index([3, 4, 5, 6], dtype='int64')
3 Int64Index([7, 8, 9], dtype='int64')
4 Int64Index([10, 11, 12, 13, 14, 15], dtype='in...
5 Int64Index([16, 17, 18, 19, 20], dtype='int64')
Name: group, dtype: object

Choosing particular values from a Pandas Series

I have the Pandas Series s, part of which can be seen below. I basically want to insert the indices of those values of s which are not 0 into a list l, but don't know how to do this.
2003-05-13 1
2003-11-2 0
2004-05-1 3
In [7] is what you're looking for below:
In [5]: s = pd.Series(np.random.choice([0,1,2], 10))
In [6]: print s
0 0
1 1
2 0
3 1
4 0
5 2
6 1
7 1
8 2
9 2
dtype: int64
In [7]: print list(s.index[s != 0])
[1, 3, 5, 6, 7, 8, 9]

Categories