Grouping data with sequence counters in python - python

Opening a new request, since the previous one couldn't get answered completely.Please merge these if possible.
Sequence counters for group based data in python
I want to group the repeating sequences in column A. Column A can have any number and not restricted to 0, 1, 2. So basically, all rows between two 0's needs to be assigned a value in B which increments with every such group. Here is sample data. A and B are fields on a Dataframe and there are others too.
A B
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3
1 3
2 3
0 4
1 4
2 4
2 4

Related

How to create a counter based on another column?

I've created this data frame -
Range = np.arange(0,9,1)
A={
0:2,
1:2,
2:2,
3:2,
4:3,
5:3,
6:3,
7:2,
8:2
}
Table = pd.DataFrame({"Row": Range})
Table["Intervals"]=(Table["Row"]%9).map(A)
Table
Row Intervals
0 0 2
1 1 2
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 2
8 8 2
I'd like to create another column that will be based on the intervals columns and will act as sort of a counter - so the values will be 1,2,1,2,1,2,3,1,2.
The logic is that I want to count by the value of the intervals column.
I've tried to use group by but the issue is that the values are displayed multiple times.
Logic:
We have 2 different values - 2 and 3. Each value will occur in the intervals column as the value itself - so 2 for example will occur twice 2,2. And 3 will occur 3 times - 3,3,3.
For the first 4 rows, the value 2 is displayed twice - that is why the new column should be 1,2 (counter of the first 2) and then again 1,2 (counter of the second 2).
Afterward, there is 3, so the values are 1,2,3.
And then once again 2, so the values are 1,2.
Hope I managed to explain myself.
Thanks in advance!
You can use groupby.cumcount combined with mod:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = Table.groupby(group).cumcount().mod(Table['Intervals']).add(1)
Or:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = (Table.groupby(group)['Intervals']
.transform(lambda s: np.arange(len(s))%s.iloc[0]+1)
)
Output:
Row Intervals Counter
0 0 2 1
1 1 2 2
2 2 2 1
3 3 2 2
4 4 3 1
5 5 3 2
6 6 3 3
7 7 2 1
8 8 2 2

Is there an efficient way to categorise rows of sequential increasing data into a group in a pandas data frame

I have a dataset that looks roughly like this (the first column being the index):
measurement value
0 1 0.617350
1 2 0.394176
2 3 0.775822
3 1 0.811693
4 2 0.621867
5 3 0.743718
6 4 0.183111
7 1 0.118586
8 2 0.274038
9 3 0.871772
My values in the second column are sequentially increasing measurement parameters, the test cycles through these measurement parameters, taking a reading at each step, before resetting and going again from the start.
The challenge I face is I need to group each cycle with a label in a fourth column.
measurement value group
0 1 0.617350 1
1 2 0.394176 1
2 3 0.775822 1
3 1 0.811693 2
4 2 0.621867 2
5 3 0.743718 2
6 4 0.183111 2
7 1 0.118586 3
8 2 0.274038 3
9 3 0.871772 3
The only solution I can think of is to have two nested for loops, the first finding the start of each measurement condition, the second counting to the end of each measurement condition, then labelling that group. This doesn't seems to be very efficient though, I wondered if there was a better way?
If each measure starting by 1 compare values by it and add cumulative sum:
df['group'] = df['measurement'].eq(1).cumsum()

Is there a function in pandas to help me count each string from each row list?

I have a dataframe like this:
df1
a b c
0 1 2 [bg10, ng45, fg56]
1 4 5 [cv10, fg56]
2 7 8 [bg10, ng45, fg56]
3 7 8 [fg56, fg56]
4 7 8 [bg10]
I would like to count the total occurences take place of each type in column 'c'. I would then like to return the value of column 'b' for the values in column 'c' that have a count total of '1'.
The expected output is soemthing like this:
c b total_count
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I have tried the 'Collections' library, and a 'for' loop (I understand its not best practise in Pandas) but i think i'm missing some fundamental udnerstanding of lists within cells, and how to perform analysis like these.
Thank you for taking my question into consideration.
I would use apply the following way:
first I create the df:
df1=pd.DataFrame({"b":[2,5,8,8], "c":[['bg10', 'ng45', 'fg56'],['cv10', 'fg56'],['bg10', 'ng45', 'fg56'],['fg56', 'fg56']]})
next use apply to count the number of (non unique) items in a list and save it in a different column:
df1["count_c"]=df1.c.apply(lambda x: len(x))
you will get the following:
b c count_c
0 2 [bg10, ng45, fg56] 3
1 5 [cv10, fg56] 2
2 8 [bg10, ng45, fg56] 3
3 8 [fg56, fg56] 2
to get the lines when c larger than threshold:`
df1[df1["count_c"]>2]["b"]
note: if you want to count only unique values in each list in column c you should use:
df1["count_c"]=df1.c.apply(lambda x: len(set(x)))
EDIT
in order to count the total number of each item I would try this:
first let's "unpack all the lists into columns
new_df1=(df1.c.apply(lambda x: pd.Series(x))).stack().reset_index(level=1,drop=True).to_frame("c").join(df1[["b"]],how="left")
then get the total counts of each item in the list and add it to a new col:
counts_dict=new_df1.c.value_counts().to_dict()
new_df1["total_count_c"]=new_df1.c.map(counts_dict)
new_df1.head()
c b total_count_c
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5

Applying operations on groups without aggregating

I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2

Pandas DataFrame dynamic subgroup operation

I've got a "Perf" dataframe with people performance data over time.
The index is a timstamp and the columns are the persons name.
There are 100 persons (columns) and each person belongs to one of 10 groups, however the group assignment is dynamic, everyday each person could be assigned to a different group.
So there is a second "Group" DataFrame of the same shape than "Perf" that contains group number (0-9) for each timestamp and person.
The question is how can I elegantly do a mean subtraction everyday for each person with regards to its group assignment?
One method that is really slow is:
for g in range(10):
Perf[Group==g] -= Perf[Group==g].mean(1)
But this is really slow, I'm sure there is a way to do it in one shot with Pandas.
here is an concrete example:
scores represents the score for each person (0-4) for 10 days (0-9):
In [8]: perf = DataFrame(np.random.randn(10,5))
In [9]: perf
Out[9]:
0 1 2 3 4
0 0.945575 -0.805883 1.338865 0.420829 -1.074329
1 -1.086116 0.430230 1.296153 0.527612 1.269646
2 0.705276 -1.409828 2.859838 -0.769508 1.520295
3 0.331860 -0.217884 0.962576 -0.495888 -1.083996
4 0.402625 0.018885 -0.260516 -0.547802 -0.995959
5 2.168944 -0.361657 0.184537 0.391014 0.972161
6 1.959699 0.590739 -0.781736 1.059761 1.080997
7 2.090273 -2.446399 0.553785 0.806368 -0.786343
8 0.441160 -2.320302 -1.981387 2.190607 0.345626
9 -0.276013 -1.319214 1.339096 0.269680 -0.509884
Then I've got some grouping dataframe that for each day shows the group association of each one of the 5 persons, the grouping changes everyday.
In [20]: grouping
Out[20]:
0 1 2 3 4
0 3 1 2 1 2
1 3 1 2 2 1
2 2 2 3 1 1
3 1 2 2 3 1
4 3 2 1 2 1
5 2 1 1 2 3
6 1 2 1 2 3
7 2 2 1 1 3
8 2 1 2 1 3
9 1 3 2 1 2
I want to modify Perf such that for each day I subtract for each person the mean score of its group.
for example for day 0 it will be 0.0 -0.613356 1.206597 0.613356 -1.206597
I want to do it in 1 line without loops. Groupby seems to be the function to use, but I couldn't use efficiently its output to perform the mean subtract operation on the original matrix.

Categories