Keep pandas dataframe columns and their order in pivot table - python

I have a dataframe:
df = pd.DataFrame({'No': [123,123,123,523,523,523,765],
'Type': ['A','B','C','A','C','D','A'],
'Task': ['First','Second','First','Second','Third','First','Fifth'],
'Color': ['blue','red','blue','black','red','red','red'],
'Price': [10,5,1,12,12,12,18],
'Unit': ['E','E','E','E','E','E','E'],
'Pers.ID': [45,6,6,43,1,9,2]
})
So it looks like this:
df
+-----+------+--------+-------+-------+------+---------+
| No | Type | Task | Color | Price | Unit | Pers.ID |
+-----+------+--------+-------+-------+------+---------+
| 123 | A | First | blue | 10 | E | 45 |
| 123 | B | Second | red | 5 | E | 6 |
| 123 | C | First | blue | 1 | E | 6 |
| 523 | A | Second | black | 12 | E | 43 |
| 523 | C | Third | red | 12 | E | 1 |
| 523 | D | First | red | 12 | E | 9 |
| 765 | A | First | red | 18 | E | 2 |
+-----+------+--------+-------+-------+------+---------+
then I created a pivot table:
piv = pd.pivot_table(df, index=['No','Type','Task'])
Result:
Pers.ID Price
No Type Task
123 A First 45 10
B Second 6 5
C First 6 1
523 A Second 43 12
C Third 1 12
D First 9 12
765 A Fifth 2 18
As you can see, problems are:
multiple columns are gone (Color and Unit)
The order of the columns Price and Pers.ID is not the same as in the original dataframe.
I tried to fix this by executing:
cols = list(df.columns)
piv = pd.pivot_table(df, index=['No','Type','Task'], values = cols)
but the result is the same.
I read other posts but none of them matched my problem in a way that I could use it.
Thank you!
EDIT: desired output
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2

I think problem is in pivot_table default aggregate function is mean, so strings columns are excluded. So need custom function, also order is changed, so reindex is necessary:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
cols = df.columns[~df.columns.isin(['No','Type','Task'])].tolist()
piv = (pd.pivot_table(df,
index=['No','Type','Task'],
values = cols,
aggfunc=f).reindex(columns=cols))
print (piv)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
Another solution with groupby and same aggregation function, ordering is not problem:
df = (df.groupby(['No','Type','Task'])
.agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)))
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
But if need set first 3 columns to MultiIndex only:
df = df.set_index(['No','Type','Task'])
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2

Related

How to obtain counts and sums for pairs of values in each row of Pandas DataFrame

Problem:
I have a DataFrame like so:
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
name | category | amount
----------------------------
0 john | a | 100
1 jim | b | 200
2 eric | c | 13
3 jim | b | 23
4 john | a | 40
5 jim | b | 2
6 jim | c | 43
7 eric | c | 92
8 eric | a | 83
9 john | c | 1
I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 + 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).
Desired output:
The output I'm looking for here looks like this:
name | category | amount | sum_for_category | count_for_category
------------------------------------------------------------------------
0 john | a | 100 | 140 | 2
1 jim | b | 200 | 225 | 3
2 eric | c | 13 | 105 | 2
3 jim | b | 23 | 225 | 3
4 john | a | 40 | 140 | 2
5 jim | b | 2 | 225 | 3
6 jim | c | 43 | 43 | 1
7 eric | c | 92 | 105 | 2
8 eric | a | 83 | 83 | 1
9 john | c | 1 | 1 | 1
I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.
Best I could do:
I can't find a good way to do this. The best I've been able to come up with is the following:
names = df["name"].unique()
categories = df["category"].unique()
sum_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)
count_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)
But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.
You need two groupby.transform:
g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')
output:
name category amount sum_for_category count_or_category
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
Another possible solution:
g = df.groupby(['name', 'category']).amount.agg(['sum','count']).reset_index()
df.merge(g, on = ['name', 'category'], how = 'left')
Output:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df_Count =
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2
Hi There,
Use a simple code to easy understand, please try these code below, Just run it you will get what you want.
Thanks
Leon

Altering groupby and value_counts output for mapping to dataframe

I have a scenario where I am trying to filter a dataframe by a particular value, and count how many times another identifier is present. I'm then turning that into a dictionary and mapping back to the dataframe. The issue I am having is that the resulting dictionary cannot be mapped back to the dataframe because I'm introducing complexity to the dictionary (extra keys?), and I don't know how to avoid it.
I guess the simple question is: 'How can I use value_counts on my CELL_ID column', filter by another column called Grid_Type, and map the results back to all cells per CELL_ID?
What I'm doing so far
This works to count how many cells contain the CELL_ID, but does NOT allow me to filter by Grid_Type
df['CELL_ID'].value_counts()
z1 = z.to_dict()
df['CELL_CNT'] = df['CELL_ID'].map(z1)
The dictionary output from this simple example looks like:
7015988: 1, 7122961: 1, 6976792: 1
My bad code
This is what I've been working on so far - where I want to be able to return the count, filtered by the Grid_Type. Eg I want to be able to count the number of times I see "Spot" in/by each CELL_ID.
z = df[df.Grid_Type == 'Spot'].groupby('CELL_ID')['Grid_Type'].value_counts()
z1 = z.to_dict()
df['SPOT_CNT'] = df['CELL_ID'].map(z1)
It seems that in the example where I'm trying to filter that the dictionary is returning a more complex result that includes the Grid_Type. The thing is, I only want the counts mapped against the Cell_ID. Eg dictionary response:
(7133691, 'Spot'): 3, (7133692, 'Spot'): 3, (7133693, 'Spot'): 2
Example Data
+---------+-----------+
| CELL_ID | Grid_Type |
+---------+-----------+
| 001 | Spot |
| 001 | Square |
| 001 | Spot |
| 001 | Square |
| 001 | Square |
| 002 | Spot |
| 002 | Square |
| 002 | Square |
| 003 | Square |
| 003 | Spot |
| 003 | Spot |
| 003 | Spot |
+---------+-----------+
Desired Outcome
+---------+-----------+----------+
| CELL_ID | Grid_Type | SPOT_CNT |
+---------+-----------+----------+
| 001 | Spot | 2 |
| 001 | Square | 2 |
| 001 | Spot | 2 |
| 001 | Square | 2 |
| 001 | Square | 2 |
| 002 | Spot | 1 |
| 002 | Square | 1 |
| 002 | Square | 1 |
| 003 | Square | 3 |
| 003 | Spot | 3 |
| 003 | Spot | 3 |
| 003 | Spot | 3 |
+---------+-----------+----------+
Thanks for any help you might be able to offer/
df = pd.read_csv('spot.txt', sep=r"[ ]{1,}", engine='python', dtype='object')
print(df)
CELL_ID Grid_Type
0 001 Spot
1 001 Square
2 001 Spot
3 001 Square
4 001 Square
5 002 Spot
6 002 Square
7 002 Square
8 003 Square
9 003 Spot
10 003 Spot
11 003 Spot
df_gb = df['Grid_Type'].groupby([df['CELL_ID']]).value_counts()
print(df_gb)
CELL_ID Grid_Type
001 Square 3
Spot 2
002 Square 2
Spot 1
003 Spot 3
Square 1
Name: Grid_Type, dtype: int64
df_gb_dict = df_gb.to_dict()
count_list = []
for idx, row in df.iterrows():
for k, v in df_gb_dict.items():
if k[0] == row['CELL_ID'] and k[1] == row['Grid_Type'] and row['Grid_Type'] == 'Spot':
count_list.append([k[0], k[1], v])
if k[0] == row['CELL_ID'] and k[1] == row['Grid_Type'] and row['Grid_Type'] == 'Square':
count_list.append([k[0], k[1], df_gb_dict[(row['CELL_ID'], 'Spot')]])
new_df = pd.DataFrame(count_list, columns=['CELL_ID', 'Grid_Type', 'SPOT_CNT'])
new_df.sort_values(by='CELL_ID', inplace=True)
new_df.reset_index(drop=True)
print(new_df)
CELL_ID Grid_Type SPOT_CNT
0 001 Spot 2
1 001 Square 2
2 001 Spot 2
3 001 Square 2
4 001 Square 2
5 002 Spot 1
6 002 Square 1
7 002 Square 1
8 003 Square 3
9 003 Spot 3
10 003 Spot 3
11 003 Spot 3
Seems you have an answer, but I would approach this problem with transform():
# set it up
df = pd.read_clipboard()
print(df)
CELL_ID Grid_Type
0 1 Spot
1 1 Square
2 1 Spot
3 1 Square
4 1 Square
5 2 Spot
6 2 Square
7 2 Square
8 3 Square
9 3 Spot
10 3 Spot
11 3 Spot
df['SPOT_CNT'] = df.groupby('CELL_ID')['Grid_Type'].transform(lambda x: sum(x == 'Spot'))
print(df)
CELL_ID Grid_Type SPOT_CNT
0 1 Spot 2
1 1 Square 2
2 1 Spot 2
3 1 Square 2
4 1 Square 2
5 2 Spot 1
6 2 Square 1
7 2 Square 1
8 3 Square 3
9 3 Spot 3
10 3 Spot 3
11 3 Spot 3
Inside the lambda function:
- it returns bool if value(x) == 'Spot'
- for each group, sum() adds up the True bools
Lastly transform,as per docs, behaves like so:
DataFrame.transform(self, func, axis=0, *args, **kwargs) → 'DataFrame'[source]
"Call func on self producing a DataFrame with transformed values."
"Produced DataFrame will have same axis length as self." <----
...
Hope this is helpful.

Pandas - Rolling average for a group across multiple columns; large dataframe

I have the following dataframe:
-----+-----+-------------+-------------+-------------------------+
| ID1 | ID2 | Box1_weight | Box2_weight | Average Prev Weight ID1 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 2 | - |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 0 | 2 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 1 | (0 + 3 )/2=1.5 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 7 | 0 | (3+0+3)/3=2 |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 3 | (0+1+1)/3=0.6 |
I want to work out the moving average of weight the past 3 boxes, based on ID. I want to do this for all IDs in ID1.
I have put the column I want to calculate, along with the calculations is in the table above, labelled "Average Prev Weight ID1"
I can get a a rolling average for each individual column using the following:
df_copy.groupby('ID1')['Box1_weight'].apply(lambda x: x.shift().rolling(period_length, min_periods=1).mean())
However, this does not take into account that the item may also have been packed in the column labelled "Box2_weight"
How can I get a rolling average that is per ID, across the two columns?
Any guidance is appreciated.
Here is my attempt
stack the 2 ids and 2 weights columns to create dataframe with 1 ids and 1 weight column. Calculate the running average and assign back the running average for ID1 back to the dataframe
I have used your code of calculating rolling average but I arranged data to df2 before doing ti
import pandas as pd
d = {
"ID1": [19,677,19,19,677],
"ID2": [677, 19, 677,677, 19],
"Box1_weight": [3,1,3,7,1],
"Box2_weight": [2,0,1,0,3]
}
df = pd.DataFrame(d)
display(df)
period_length=3
ids = df[["ID1", "ID2"]].stack().values
weights = df[["Box1_weight", "Box2_weight"]].stack().values
df2=pd.DataFrame(dict(ids=ids, weights=weights))
rolling_avg = df2.groupby("ids")["weights"] \
.apply(lambda x: x.shift().rolling(period_length, min_periods=1)
.mean()).values.reshape(-1,2)
df["rolling_avg"] = rolling_avg[:,0]
display(df)
Result
ID1 ID2 Box1_weight Box2_weight
0 19 677 3 2
1 677 19 1 0
2 19 677 3 1
3 19 677 7 0
4 677 19 1 3
ID1 ID2 Box1_weight Box2_weight rolling_avg
0 19 677 3 2 NaN
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667
Not sure if this is what you want. I had trouble understanding your requirements. But here's a go:
ids = ['ID1', 'ID2']
ind = np.argsort(df[ids].to_numpy(), 1)
make_sort = lambda s, ind: np.take_along_axis(s, ind, axis=1)
f = make_sort(df[ids].to_numpy(), ind)
s = make_sort(df[['Box1_weight', 'Box2_weight']].to_numpy(), ind)
df2 = pd.DataFrame(np.concatenate([f,s], 1), columns=df.columns)
res1 = df2.groupby('ID1').Box1_weight.rolling(3, min_periods=1).mean().shift()
res2 = df2.groupby('ID2').Box2_weight.rolling(3, min_periods=1).mean().shift()
means = pd.concat([res1,res2], 1).rename(columns={'Box1_weight': 'w1', 'Box2_weight': 'w2'})
x = df.set_index([df.ID1.values, df.index])
final = x[ids].merge(means, left_index=True, right_index=True)[['w1','w2']].sum(1).sort_index(level=1)
df['final_weight'] = final.tolist()
ID1 ID2 Box1_weight Box2_weight final_weight
0 19 677 3 2 0.000000
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667

Plots shifting in heatmaps in Seaborn Facetgrid

Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()

How can i create pivot_table with pandas, where displayed other fields than i use for index

I use package "pandas" for python. And i have a question.
I have DataFrame like this:
| first | last | datr |city|
|Zahir |Petersen|22.11.15|9 |
|Zahir |Petersen|22.11.15|2 |
|Mason |Sellers |10.04.16|4 |
|Gannon |Cline |29.10.15|2 |
|Craig |Sampson |20.04.16|2 |
|Craig |Sampson |20.04.16|4 |
|Cameron |Mathis |09.05.15|6 |
|Adam |Hurley |16.04.16|2 |
|Brock |Vaughan |14.04.16|10 |
|Xanthus |Murray |30.03.15|6 |
|Xanthus |Murray |30.03.15|7 |
|Xanthus |Murray |30.03.15|4 |
|Palmer |Caldwell|31.10.15|2 |
I want create pivot_table by fields ['first', 'last', 'datr'], but display
['first', 'last', 'datr','city'] where count of record by ['first', 'last', 'datr'] more than one, like this:
| first | last | datr |city|
|Zahir |Petersen|22.11.15|9 | 2
| | | |2 | 2
|Craig |Sampson |20.04.16|2 | 2
| | | |4 | 2
|Xanthus |Murray |30.03.15|6 | 3
| | | |7 | 3
| | | |4 | 3
UPD.
If i groupby three fields from four, than
df['count'] = df.groupby(['first','last','datr']).transform('count')
is work, but if count of all columns-columns for "groupby" > 1 than this code throw error. For example(all columns - 4('first','last', 'datr', 'city'), columns for groupby - 2('first','last'), 4-2 = 2:
In [181]: df['count'] = df.groupby(['first','last']).transform('count')
...
ValueError: Wrong number of items passed 2, placement implies 1
You can do this with groupby. Group by the three columns (first, last and datr), and then count the number of elements in each group:
In [63]: df['count'] = df.groupby(['first', 'last', 'datr']).transform('count')
In [64]: df
Out[64]:
first last datr city count
0 Zahir Petersen 22.11.15 9 2
1 Zahir Petersen 22.11.15 2 2
2 Mason Sellers 10.04.16 4 1
3 Gannon Cline 29.10.15 2 1
4 Craig Sampson 20.04.16 2 2
5 Craig Sampson 20.04.16 4 2
6 Cameron Mathis 09.05.15 6 1
7 Adam Hurley 16.04.16 2 1
8 Brock Vaughan 14.04.16 10 1
9 Xanthus Murray 30.03.15 6 3
10 Xanthus Murray 30.03.15 7 3
11 Xanthus Murray 30.03.15 4 3
12 Palmer Caldwell 31.10.15 2 1
From there, you can filter the frame:
In [65]: df[df['count'] > 1]
Out[65]:
first last datr city count
0 Zahir Petersen 22.11.15 9 2
1 Zahir Petersen 22.11.15 2 2
4 Craig Sampson 20.04.16 2 2
5 Craig Sampson 20.04.16 4 2
9 Xanthus Murray 30.03.15 6 3
10 Xanthus Murray 30.03.15 7 3
11 Xanthus Murray 30.03.15 4 3
And if you want these columns as the index (as in the example output in your question): df.set_index(['first', 'last', 'datr'])

Categories