Pandas DataFrame dynamic subgroup operation - python

I've got a "Perf" dataframe with people performance data over time.
The index is a timstamp and the columns are the persons name.
There are 100 persons (columns) and each person belongs to one of 10 groups, however the group assignment is dynamic, everyday each person could be assigned to a different group.
So there is a second "Group" DataFrame of the same shape than "Perf" that contains group number (0-9) for each timestamp and person.
The question is how can I elegantly do a mean subtraction everyday for each person with regards to its group assignment?
One method that is really slow is:
for g in range(10):
Perf[Group==g] -= Perf[Group==g].mean(1)
But this is really slow, I'm sure there is a way to do it in one shot with Pandas.
here is an concrete example:
scores represents the score for each person (0-4) for 10 days (0-9):
In [8]: perf = DataFrame(np.random.randn(10,5))
In [9]: perf
Out[9]:
0 1 2 3 4
0 0.945575 -0.805883 1.338865 0.420829 -1.074329
1 -1.086116 0.430230 1.296153 0.527612 1.269646
2 0.705276 -1.409828 2.859838 -0.769508 1.520295
3 0.331860 -0.217884 0.962576 -0.495888 -1.083996
4 0.402625 0.018885 -0.260516 -0.547802 -0.995959
5 2.168944 -0.361657 0.184537 0.391014 0.972161
6 1.959699 0.590739 -0.781736 1.059761 1.080997
7 2.090273 -2.446399 0.553785 0.806368 -0.786343
8 0.441160 -2.320302 -1.981387 2.190607 0.345626
9 -0.276013 -1.319214 1.339096 0.269680 -0.509884
Then I've got some grouping dataframe that for each day shows the group association of each one of the 5 persons, the grouping changes everyday.
In [20]: grouping
Out[20]:
0 1 2 3 4
0 3 1 2 1 2
1 3 1 2 2 1
2 2 2 3 1 1
3 1 2 2 3 1
4 3 2 1 2 1
5 2 1 1 2 3
6 1 2 1 2 3
7 2 2 1 1 3
8 2 1 2 1 3
9 1 3 2 1 2
I want to modify Perf such that for each day I subtract for each person the mean score of its group.
for example for day 0 it will be 0.0 -0.613356 1.206597 0.613356 -1.206597
I want to do it in 1 line without loops. Groupby seems to be the function to use, but I couldn't use efficiently its output to perform the mean subtract operation on the original matrix.

Related

How to create a counter based on another column?

I've created this data frame -
Range = np.arange(0,9,1)
A={
0:2,
1:2,
2:2,
3:2,
4:3,
5:3,
6:3,
7:2,
8:2
}
Table = pd.DataFrame({"Row": Range})
Table["Intervals"]=(Table["Row"]%9).map(A)
Table
Row Intervals
0 0 2
1 1 2
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 2
8 8 2
I'd like to create another column that will be based on the intervals columns and will act as sort of a counter - so the values will be 1,2,1,2,1,2,3,1,2.
The logic is that I want to count by the value of the intervals column.
I've tried to use group by but the issue is that the values are displayed multiple times.
Logic:
We have 2 different values - 2 and 3. Each value will occur in the intervals column as the value itself - so 2 for example will occur twice 2,2. And 3 will occur 3 times - 3,3,3.
For the first 4 rows, the value 2 is displayed twice - that is why the new column should be 1,2 (counter of the first 2) and then again 1,2 (counter of the second 2).
Afterward, there is 3, so the values are 1,2,3.
And then once again 2, so the values are 1,2.
Hope I managed to explain myself.
Thanks in advance!
You can use groupby.cumcount combined with mod:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = Table.groupby(group).cumcount().mod(Table['Intervals']).add(1)
Or:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = (Table.groupby(group)['Intervals']
.transform(lambda s: np.arange(len(s))%s.iloc[0]+1)
)
Output:
Row Intervals Counter
0 0 2 1
1 1 2 2
2 2 2 1
3 3 2 2
4 4 3 1
5 5 3 2
6 6 3 3
7 7 2 1
8 8 2 2

Is there an efficient way to categorise rows of sequential increasing data into a group in a pandas data frame

I have a dataset that looks roughly like this (the first column being the index):
measurement value
0 1 0.617350
1 2 0.394176
2 3 0.775822
3 1 0.811693
4 2 0.621867
5 3 0.743718
6 4 0.183111
7 1 0.118586
8 2 0.274038
9 3 0.871772
My values in the second column are sequentially increasing measurement parameters, the test cycles through these measurement parameters, taking a reading at each step, before resetting and going again from the start.
The challenge I face is I need to group each cycle with a label in a fourth column.
measurement value group
0 1 0.617350 1
1 2 0.394176 1
2 3 0.775822 1
3 1 0.811693 2
4 2 0.621867 2
5 3 0.743718 2
6 4 0.183111 2
7 1 0.118586 3
8 2 0.274038 3
9 3 0.871772 3
The only solution I can think of is to have two nested for loops, the first finding the start of each measurement condition, the second counting to the end of each measurement condition, then labelling that group. This doesn't seems to be very efficient though, I wondered if there was a better way?
If each measure starting by 1 compare values by it and add cumulative sum:
df['group'] = df['measurement'].eq(1).cumsum()

Use drop duplicates in Pandas DF but choose keep column based on a preference list

I have dataframe with many columns. There is a datetime column, and there are duplicated entries for the datetime with data for those duplicates coming from different sources. I would like to drop the duplicates based on column "dt", but I want to keep the result based on what is in column "pref". I have provided simplified data below, but the reason for this is that I also have a value column, and the "Pref" column is the data source. I prefer certain data sources, but I only need one entry per date (column "dt"). I would like this code to work so that I don't have to provide a complete list of preferences either.
Artificial Data Code
import pandas as pd
import numpy as np
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
df
Out[1]:
dt Pref Value String_col
0 1 1 -0.479593 A
1 1 2 0.553963 A
2 1 3 0.194266 A
3 2 2 0.598814 A
4 2 3 -0.909138 A
5 3 1 -0.297539 A
6 3 3 -1.100855 A
7 4 1 0.747354 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 1 (CASE 1):
In this case I my preference list matters all the way down. I prefer data source 2 the most, followed by 1, but will take 3 if that is all I have.
preference_list=[2,1,3]
Out[2]:
dt Pref Value String_col
1 1 2 0.553963 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 2 (CASE 2)
In this case I just want to look for data source 1. If it is not present I don't actually care what the other data source is.
preference_list2=[1]
Out[3]:
dt Pref Value String_col
0 1 1 -0.479593 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
7 4 1 0.747354 A
9 5 3 0.301373 A
I can imagine doing this in a really slow and complicated loop, but I feel like there should be a command to accomplish this. Another important thing: I need to keep some other text columns in the data frame so .agg may cause issue for those metadata. I have experimented with sorting and using the keep argument in drop_duplicates, but with no success.
You are actually looking for sorting by category, which can be done by pd.Categorical:
df["Pref"] = pd.Categorical(df["Pref"], categories=preference_list, ordered=True)
print (df.sort_values(["dt","Pref"]).drop_duplicates("dt"))
dt Pref Value String_col
1 1 2 -1.004362 A
3 2 2 -1.316961 A
5 3 1 0.513618 A
8 4 2 -1.859514 A
9 5 3 1.199374 A
here is a very efficient and simple solution, I hope it helps !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
preference_list = [2,3]
df_clean = df[df['Pref'].isin(preference_list)]
print(df)
print(df_clean)
Output:
dt Pref Value String_col
0 1 1 1.404505 A
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
5 3 1 -1.208514 A
6 3 3 -0.456773 A
7 4 1 0.574463 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
dt Pref Value String_col
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
6 3 3 -0.456773 A
8 4 2 -1.682750 A
9 5 3 0.719394 A

Grouping data with sequence counters in python

Opening a new request, since the previous one couldn't get answered completely.Please merge these if possible.
Sequence counters for group based data in python
I want to group the repeating sequences in column A. Column A can have any number and not restricted to 0, 1, 2. So basically, all rows between two 0's needs to be assigned a value in B which increments with every such group. Here is sample data. A and B are fields on a Dataframe and there are others too.
A B
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3
1 3
2 3
0 4
1 4
2 4
2 4

Is it possible to use vectorization for a conditionnal count of rows in a Pandas Dataframe?

I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1

Categories