Histogram with custom y frequency python - python

I am trying to plot the following data
+-----------+------+------+
| Duration | Code | Seq. |
+-----------+------+------+
| 116.15 | 65 | 1 |
| 120.45 | 65 | 1 |
| 118.92 | 65 | 1 |
| 7.02 | 66 | 1 |
| 73.93 | 66 | 2 |
| 117.53 | 66 | 1 |
| 4.4 | 66 | 2 |
| 111.03 | 66 | 1 |
| 4.35 | 66 | 1 |
+-----------+------+------+
I have my code as:
x1 = df.loc[df.Code==65, 'Duration']
x2 = df.loc[df.Code==66, 'Duration']
kwargs = dict(alpha=0.5, bins=10)
plt.hist(x1, **kwargs, color='k', label='Code 65')
plt.hist(x2, **kwargs, color='g', label='Code 66')
What I ideally want on the y axis is the number of Seq.corresponding to different Durationson x axis. But now, I only get the count of the Durationson y. How do I correct this?

You might bin the 'x' values using pandas and then use a bar chart instead.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Duration':[116.15, 120.45,118.92,7.02,73.93, 117.53, 4.4, 111.03, 4.35]})
df['Code'] = [65,65,65,66,66,66,66,66,66]
df['Seq.'] = [1,1,1,1,2,1,2,1,1]
df
Duration Code Seq.
0 116.15 65 1
1 120.45 65 1
2 118.92 65 1
3 7.02 66 1
4 73.93 66 2
5 117.53 66 1
6 4.40 66 2
7 111.03 66 1
8 4.35 66 1
df['bin'] = pd.cut(df['Duration'],10, labels=False)
df
Duration Code Seq. bin
0 116.15 65 1 9
1 120.45 65 1 9
2 118.92 65 1 9
3 7.02 66 1 0
4 73.93 66 2 5
5 117.53 66 1 9
6 4.40 66 2 0
7 111.03 66 1 9
8 4.35 66 1 0
x1 = df.loc[df.Code==65, 'bin']
x2 = df.loc[df.Code==66, 'bin']
y1 = df.loc[df.Code==65, 'Seq.']
y2 = df.loc[df.Code==66, 'Seq.']
plt.bar(x1, y1)
plt.bar(x2, y2)
plt.show()

Related

How to obtain counts and sums for pairs of values in each row of Pandas DataFrame

Problem:
I have a DataFrame like so:
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
name | category | amount
----------------------------
0 john | a | 100
1 jim | b | 200
2 eric | c | 13
3 jim | b | 23
4 john | a | 40
5 jim | b | 2
6 jim | c | 43
7 eric | c | 92
8 eric | a | 83
9 john | c | 1
I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 + 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).
Desired output:
The output I'm looking for here looks like this:
name | category | amount | sum_for_category | count_for_category
------------------------------------------------------------------------
0 john | a | 100 | 140 | 2
1 jim | b | 200 | 225 | 3
2 eric | c | 13 | 105 | 2
3 jim | b | 23 | 225 | 3
4 john | a | 40 | 140 | 2
5 jim | b | 2 | 225 | 3
6 jim | c | 43 | 43 | 1
7 eric | c | 92 | 105 | 2
8 eric | a | 83 | 83 | 1
9 john | c | 1 | 1 | 1
I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.
Best I could do:
I can't find a good way to do this. The best I've been able to come up with is the following:
names = df["name"].unique()
categories = df["category"].unique()
sum_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)
count_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)
But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.
You need two groupby.transform:
g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')
output:
name category amount sum_for_category count_or_category
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
Another possible solution:
g = df.groupby(['name', 'category']).amount.agg(['sum','count']).reset_index()
df.merge(g, on = ['name', 'category'], how = 'left')
Output:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df_Count =
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2
Hi There,
Use a simple code to easy understand, please try these code below, Just run it you will get what you want.
Thanks
Leon

Pandas - Rolling average for a group across multiple columns; large dataframe

I have the following dataframe:
-----+-----+-------------+-------------+-------------------------+
| ID1 | ID2 | Box1_weight | Box2_weight | Average Prev Weight ID1 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 2 | - |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 0 | 2 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 1 | (0 + 3 )/2=1.5 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 7 | 0 | (3+0+3)/3=2 |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 3 | (0+1+1)/3=0.6 |
I want to work out the moving average of weight the past 3 boxes, based on ID. I want to do this for all IDs in ID1.
I have put the column I want to calculate, along with the calculations is in the table above, labelled "Average Prev Weight ID1"
I can get a a rolling average for each individual column using the following:
df_copy.groupby('ID1')['Box1_weight'].apply(lambda x: x.shift().rolling(period_length, min_periods=1).mean())
However, this does not take into account that the item may also have been packed in the column labelled "Box2_weight"
How can I get a rolling average that is per ID, across the two columns?
Any guidance is appreciated.
Here is my attempt
stack the 2 ids and 2 weights columns to create dataframe with 1 ids and 1 weight column. Calculate the running average and assign back the running average for ID1 back to the dataframe
I have used your code of calculating rolling average but I arranged data to df2 before doing ti
import pandas as pd
d = {
"ID1": [19,677,19,19,677],
"ID2": [677, 19, 677,677, 19],
"Box1_weight": [3,1,3,7,1],
"Box2_weight": [2,0,1,0,3]
}
df = pd.DataFrame(d)
display(df)
period_length=3
ids = df[["ID1", "ID2"]].stack().values
weights = df[["Box1_weight", "Box2_weight"]].stack().values
df2=pd.DataFrame(dict(ids=ids, weights=weights))
rolling_avg = df2.groupby("ids")["weights"] \
.apply(lambda x: x.shift().rolling(period_length, min_periods=1)
.mean()).values.reshape(-1,2)
df["rolling_avg"] = rolling_avg[:,0]
display(df)
Result
ID1 ID2 Box1_weight Box2_weight
0 19 677 3 2
1 677 19 1 0
2 19 677 3 1
3 19 677 7 0
4 677 19 1 3
ID1 ID2 Box1_weight Box2_weight rolling_avg
0 19 677 3 2 NaN
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667
Not sure if this is what you want. I had trouble understanding your requirements. But here's a go:
ids = ['ID1', 'ID2']
ind = np.argsort(df[ids].to_numpy(), 1)
make_sort = lambda s, ind: np.take_along_axis(s, ind, axis=1)
f = make_sort(df[ids].to_numpy(), ind)
s = make_sort(df[['Box1_weight', 'Box2_weight']].to_numpy(), ind)
df2 = pd.DataFrame(np.concatenate([f,s], 1), columns=df.columns)
res1 = df2.groupby('ID1').Box1_weight.rolling(3, min_periods=1).mean().shift()
res2 = df2.groupby('ID2').Box2_weight.rolling(3, min_periods=1).mean().shift()
means = pd.concat([res1,res2], 1).rename(columns={'Box1_weight': 'w1', 'Box2_weight': 'w2'})
x = df.set_index([df.ID1.values, df.index])
final = x[ids].merge(means, left_index=True, right_index=True)[['w1','w2']].sum(1).sort_index(level=1)
df['final_weight'] = final.tolist()
ID1 ID2 Box1_weight Box2_weight final_weight
0 19 677 3 2 0.000000
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667

Plots shifting in heatmaps in Seaborn Facetgrid

Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()

Keep pandas dataframe columns and their order in pivot table

I have a dataframe:
df = pd.DataFrame({'No': [123,123,123,523,523,523,765],
'Type': ['A','B','C','A','C','D','A'],
'Task': ['First','Second','First','Second','Third','First','Fifth'],
'Color': ['blue','red','blue','black','red','red','red'],
'Price': [10,5,1,12,12,12,18],
'Unit': ['E','E','E','E','E','E','E'],
'Pers.ID': [45,6,6,43,1,9,2]
})
So it looks like this:
df
+-----+------+--------+-------+-------+------+---------+
| No | Type | Task | Color | Price | Unit | Pers.ID |
+-----+------+--------+-------+-------+------+---------+
| 123 | A | First | blue | 10 | E | 45 |
| 123 | B | Second | red | 5 | E | 6 |
| 123 | C | First | blue | 1 | E | 6 |
| 523 | A | Second | black | 12 | E | 43 |
| 523 | C | Third | red | 12 | E | 1 |
| 523 | D | First | red | 12 | E | 9 |
| 765 | A | First | red | 18 | E | 2 |
+-----+------+--------+-------+-------+------+---------+
then I created a pivot table:
piv = pd.pivot_table(df, index=['No','Type','Task'])
Result:
Pers.ID Price
No Type Task
123 A First 45 10
B Second 6 5
C First 6 1
523 A Second 43 12
C Third 1 12
D First 9 12
765 A Fifth 2 18
As you can see, problems are:
multiple columns are gone (Color and Unit)
The order of the columns Price and Pers.ID is not the same as in the original dataframe.
I tried to fix this by executing:
cols = list(df.columns)
piv = pd.pivot_table(df, index=['No','Type','Task'], values = cols)
but the result is the same.
I read other posts but none of them matched my problem in a way that I could use it.
Thank you!
EDIT: desired output
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
I think problem is in pivot_table default aggregate function is mean, so strings columns are excluded. So need custom function, also order is changed, so reindex is necessary:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
cols = df.columns[~df.columns.isin(['No','Type','Task'])].tolist()
piv = (pd.pivot_table(df,
index=['No','Type','Task'],
values = cols,
aggfunc=f).reindex(columns=cols))
print (piv)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
Another solution with groupby and same aggregation function, ordering is not problem:
df = (df.groupby(['No','Type','Task'])
.agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)))
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
But if need set first 3 columns to MultiIndex only:
df = df.set_index(['No','Type','Task'])
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2

Sort within a group and add a columns indicating rows below and above

I have a pandas dataframe that contains something like
+------+--------+-----+-------+
| Team | Gender | Age | Name |
+------+--------+-----+-------+
| A | M | 22 | Sam |
| A | F | 25 | Annie |
| B | M | 33 | Fred |
| B | M | 18 | James |
| A | M | 56 | Alan |
| B | F | 28 | Julie |
| A | M | 33 | Greg |
+------+--------+-----+-------+
What I'm trying to do is first group by Team and Gender (which I have been able to do so by using: df.groupby(['Team'], as_index=False)
Is there a way to sort the members of the group based on their age and add extra columns in there which would indicate how many members are above any particular member and how many below?
eg:
For group 'Team A':
+------+--------+-----+-------+---------+---------+---------+---------+
| Team | Gender | Age | Name | M_Above | M_Below | F_Above | F_Below |
+------+--------+-----+-------+---------+---------+---------+---------+
| A | M | 22 | Sam | 0 | 2 | 0 | 1 |
| A | F | 25 | Annie | 1 | 2 | 0 | 0 |
| A | M | 33 | Greg | 1 | 1 | 1 | 0 |
| A | M | 56 | Alan | 2 | 0 | 1 | 0 |
+------+--------+-----+-------+---------+---------+---------+---------+
import pandas as pd
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
And it results as:
Age Gender Team M_Above M_below F_Above F_Below
0 22 M A 0.0 2.0 0.0 1.0
1 25 F A 1.0 2.0 0.0 0.0
6 33 M A 1.0 1.0 1.0 0.0
4 56 M A 2.0 0.0 1.0 0.0
3 18 M B 0.0 1.0 0.0 1.0
5 28 F B 1.0 1.0 0.0 0.0
2 33 M B 1.0 0.0 1.0 0.0
And if you wish to get the new columns as int (as in your example), use:
for new_col in ['M_Above', 'M_Below', 'F_Above', 'F_Below']:
df[new_col] = df[new_col].astype(int)
Which results:
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0
EDIT: (running times comparison)
Note that this solution is faster than using ix (the approved solution). Average running time (over 1000 iterations) is ~6 times faster (which would probably matter in bigger DataFrames). Run this to check:
import pandas as pd
from time import time
import numpy as np
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
times = []
times2 = []
for i in range(1000):
tic = time()
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
toc = time()
times.append(toc-tic)
for i in range(1000):
tic = time()
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
toc = time()
times2.append(toc-tic)
print(np.mean(times))
print(np.mean(times2))
Results:
0.0163134906292 # alternative solution
0.0622982912064 # approved solution
You can apply custom function f with groupby by column Team.
In function f for each row first filter above and below values by ix, then drop value and get values desired values by value_counts. Some values are missing, so need reindex and then select by ix:
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
#cast float to int
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
print (df1)
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0

Categories