Get weighting of row per group - python

I want to get the percentage/weighting of a row per group. An example of the dataframe is seen below.
Place District Count
A 1 12
B 1 13
C 1 34
D 2 56
E 2 1
F 3 23
I need to group by the District but get a percentage or weighting on the Count for each row Place. For example, the calculation for Place A would be 12/(12+13+34) and B would be 13/(12+13+34).
The expected outcome would be:
Place District Count Weighting
A 1 12 0,203389831
B 1 13 0,220338983
C 1 34 0,576271186
D 2 56 0,98245614
E 2 1 0,01754386
F 3 23 1
I am using pandas dataframes.

IIUC, GroupBy.transform
df['Weighting'] = df['Count'].div(df.groupby('District')['Count'].transform('sum'))
Output
Place District Count Weighting
0 A 1 12 0.203390
1 B 1 13 0.220339
2 C 1 34 0.576271
3 D 2 56 0.982456
4 E 2 1 0.017544
5 F 3 23 1.000000

Related

How to delete groups based on conditions/values of another column

In my dataframe I want to delete those groups of column B, in which all values in column C are smaller than 3.
So there should only be those groups left, which only have values in column C that are bigger than 3.
B
C
11
1
22
2
11
2
22
4
22
1
33
2
33
1
22
4
So in my example only group 22 should stay.
Probably something like this pseudo code:
df_clean = df.groupby('B')['C']< 3.0
How do I code an algorithm that can do this?
maybe by creating df_count counting the number of elements with C-value greater 2:
df_count = df.groupby(['B'])['C'].apply(lambda x: (x>2).sum()).reset_index(name='count')
B count
0 11 0
1 22 2
2 33 0
and then sorting out those with 0:
df = df[df['B'].isin(df_count[df_count['count'] > 0]['B'].unique())].sort_index()
B C
1 22 2
3 22 4
4 22 1
7 22 4
As I understand, each group must have all its value less than 3 to be considered,
I'd start by getting the groups that satisfy this condition by comparing the maximum value with the target:3
>>> groups = [group for group in df['C'].unique() if max(df[df.C==group].B.values) < 3]
>>> groups
[11, 33]
then, you can slice your dataframe and get a new one with only the desired groups
>>> df[df.C.isin(groups)]
C B
0 11 1
2 11 2
5 33 2
6 33 1
You can use groupby and any with a for loop to get your desired output.
for i ,j in df.groupby('B'):
if (j['C'] >= 3).any() == True:
result = j
B C
1 22 2
3 22 4
4 22 1
7 22 4
or other way round if you are looking for groups with all values less than 3.
result = []
for i ,j in df.groupby('B'):
if (j['C'] < 3).all() == True:
result.append(j)
[ B C
0 11 1
2 11 2,
B C
5 33 2
6 33 1]

Python dataframe rank each column based on row values

I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.
You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

pandas increment count column based on value of another column

I have a df that is the result of a join:
ID count
0 A 30
1 A 30
2 B 5
3 C 44
4 C 44
5 C 44
I would like to be able to iterate the count column based on the ID column. Here is an example of the desired result:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
I know there are non-pythonic ways to do this via loops, but I am wondering if there is a more intelligent (and time efficient, as this table is large) way to do this.
Transform the group to get a cumulative count and add it to count, eg:
df['count'] += df.groupby('ID')['count'].cumcount()
Gives you:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46

Looking for a sequential pattern with condition

I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False

Categories