I have a somewhat involved transformation of my data where I was wondering if someone had a more efficient method than mine. I start with a dataframe as this one:
| | item | value |
|---:|:------------------|--------:|
| 0 | WAUZZZF23MN053792 | 0 |
| 1 | A | 1 |
| 2 | WF0TK3SS2MMA50940 | 0 |
| 3 | A | 10 |
| 4 | B | 11 |
| 5 | C | 12 |
| 6 | D | 13 |
| 7 | E | 14 |
| 8 | W0VEAZKXZMJ857138 | 0 |
| 9 | A | 20 |
| 10 | B | 21 |
| 11 | C | 22 |
| 12 | D | 23 |
| 13 | E | 24 |
| 14 | W0VEAZKXZMJ837930 | 0 |
| 15 | A | 30 |
| 16 | B | 31 |
| 17 | C | 32 |
| 18 | D | 33 |
| 19 | E | 34 |
and I would like to arrive here:
| | item | value | C |
|---:|:------------------|--------:|----:|
| 0 | WAUZZZF23MN053792 | 0 | nan |
| 1 | WF0TK3SS2MMA50940 | 0 | 12 |
| 2 | W0VEAZKXZMJ857138 | 0 | 22 |
| 3 | W0VEAZKXZMJ837930 | 0 | 32 |
i.e. for every "long" entry, check if there is an item "C" following, and if so, copy that line's value to the line with the long item.
The ugly way i have done this is the following:
import re
import pandas as pd
df = pd.DataFrame(
{
"item": [
"WAUZZZF23MN053792",
"A",
"WF0TK3SS2MMA50940",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ857138",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ837930",
"A",
"B",
"C",
"D",
"E",
],
"value": [
0,
1,
0,
10,
11,
12,
13,
14,
0,
20,
21,
22,
23,
24,
0,
30,
31,
32,
33,
33,
],
}
)
def isVIN(x):
return (len(x) == 17) & (x.upper() == x) & (re.search("\s|O|I", x) is None)
# filter the lines with item=="C" or a VIN in item
x = pd.concat([df, df["item"].rename("group").apply(isVIN).cumsum()], axis=1).loc[
lambda x: (x["item"] == "C") | (x["item"].apply(isVIN))
]
# pivot the lines where item="C"
y = x.loc[x["item"] == "C"].pivot(columns="item").droplevel(level=1, axis=1)
# and then merge the two:
print(
x.loc[x["item"].apply(isVIN)]
.merge(y, on="group", how="left")
.drop("group", axis=1)
.rename(columns={"value_y": "C", "value_x": "value"})
.to_markdown()
)
Does anyone have an idea how to make this a bit less ugly?
Subjectively less ugly
mask = df.item.str.len().eq(17)
df.set_index(
[df.item.where(mask).ffill(), 'item']
)[~mask.to_numpy()].value.unstack()['C'].reset_index()
item C
0 W0VEAZKXZMJ837930 32.0
1 W0VEAZKXZMJ857138 22.0
2 WAUZZZF23MN053792 NaN
3 WF0TK3SS2MMA50940 12.0
A bit more involved but better
mask = df.item.str.len().eq(17)
item = df.item.where(mask).pad()
subs = df.item.mask(mask)
valu = df.value
i, r = pd.factorize(item)
j, c = pd.factorize(subs)
a = np.zeros((len(r), len(c)), valu.dtype)
a[i, j] = valu
pd.DataFrame(a, r, c)[['C']].rename_axis('item').reset_index()
item C
0 WAUZZZF23MN053792 0
1 WF0TK3SS2MMA50940 12
2 W0VEAZKXZMJ857138 22
3 W0VEAZKXZMJ837930 32
Try:
# Your conditions vectorized
m = ((df['item'].str.len() == 17)
& (df['item'].str.upper() == df['item'])
& (~df['item'].str.contains(r'\s|O|I')))
# Create virtual groups to align rows
df['grp'] = m.cumsum()
# Merge and align rows
out = (pd.concat([df[m].set_index('grp'),
df[~m].pivot('grp', 'item', 'value')], axis=1)
.reset_index(drop=True))
Output:
>>> out
item value A B C D E
0 WAUZZZF23MN053792 0 1.0 NaN NaN NaN NaN
1 WF0TK3SS2MMA50940 0 10.0 11.0 12.0 13.0 14.0
2 W0VEAZKXZMJ857138 0 20.0 21.0 22.0 23.0 24.0
3 W0VEAZKXZMJ837930 0 30.0 31.0 32.0 33.0 33.0
The other answers are all very nice. For a bit more variety, you could also filter df for "long" data and C values; concat; then "compress" the DataFrame using groupby + first:
out = pd.concat([df[df['item'].str.len()==17],
df.loc[df['item']=='C', ['value']].set_axis(['C'], axis=1)], axis=1)
out = out.groupby(out['item'].str.len().eq(17).cumsum()).first().reset_index(drop=True)
Output:
item value C
0 WAUZZZF23MN053792 0.0 NaN
1 WF0TK3SS2MMA50940 0.0 12.0
2 W0VEAZKXZMJ857138 0.0 22.0
3 W0VEAZKXZMJ837930 0.0 32.0
How about this with datar, a pandas wrapper that reimagines pandas APIs:
Construct data
>>> import re
>>> from datar.all import (
... c, f, LETTERS, tibble, first, cumsum,
... mutate, group_by, slice, first, pivot_wider, select
... )
>>>
>>> df = tibble(
... item=c(
... "WAUZZZF23MN053792",
... "A",
... "WF0TK3SS2MMA50940",
... LETTERS[:5],
... "W0VEAZKXZMJ857138",
... LETTERS[:5],
... "W0VEAZKXZMJ837930",
... LETTERS[:5],
... ),
... value=c(
... 0, 1,
... 0, f[10:15],
... 0, f[20:25],
... 0, f[30:35],
... )
... )
>>> df
item value
<object> <int64>
0 WAUZZZF23MN053792 0
1 A 1
2 WF0TK3SS2MMA50940 0
3 A 10
4 B 11
5 C 12
6 D 13
7 E 14
8 W0VEAZKXZMJ857138 0
9 A 20
10 B 21
11 C 22
12 D 23
13 E 24
14 W0VEAZKXZMJ837930 0
15 A 30
16 B 31
17 C 32
18 D 33
19 E 34
Manipulate data
>>> def isVIN(x):
... return len(x) == 17 and x.isupper() and re.search(r"\s|O|I", x) is None
...
>>> (
... df
... # Mark the VIN groups
... >> mutate(is_vin=cumsum(f.item.transform(isVIN)))
... # Group by VINs
... >> group_by(f.is_vin)
... # Put the VINs and their values in new columns
... >> mutate(vin=first(f.item), vin_value=first(f.value))
... # Exclude VINs in the items
... >> slice(~c(0))
... # Get the values of A, B, C ...
... >> pivot_wider([f.vin, f.vin_value], names_from=f.item, values_from=f.value)
... # Select and rename columns
... >> select(item=f.vin, value=f.vin_value, C=f.C)
... )
item value C
<object> <int64> <float64>
0 W0VEAZKXZMJ837930 0 32.0
1 W0VEAZKXZMJ857138 0 22.0
2 WAUZZZF23MN053792 0 NaN
3 WF0TK3SS2MMA50940 0 12.0
Related
Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})
You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12
I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |
Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0
I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()
I have a pandas dataframe that contains something like
+------+--------+-----+-------+
| Team | Gender | Age | Name |
+------+--------+-----+-------+
| A | M | 22 | Sam |
| A | F | 25 | Annie |
| B | M | 33 | Fred |
| B | M | 18 | James |
| A | M | 56 | Alan |
| B | F | 28 | Julie |
| A | M | 33 | Greg |
+------+--------+-----+-------+
What I'm trying to do is first group by Team and Gender (which I have been able to do so by using: df.groupby(['Team'], as_index=False)
Is there a way to sort the members of the group based on their age and add extra columns in there which would indicate how many members are above any particular member and how many below?
eg:
For group 'Team A':
+------+--------+-----+-------+---------+---------+---------+---------+
| Team | Gender | Age | Name | M_Above | M_Below | F_Above | F_Below |
+------+--------+-----+-------+---------+---------+---------+---------+
| A | M | 22 | Sam | 0 | 2 | 0 | 1 |
| A | F | 25 | Annie | 1 | 2 | 0 | 0 |
| A | M | 33 | Greg | 1 | 1 | 1 | 0 |
| A | M | 56 | Alan | 2 | 0 | 1 | 0 |
+------+--------+-----+-------+---------+---------+---------+---------+
import pandas as pd
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
And it results as:
Age Gender Team M_Above M_below F_Above F_Below
0 22 M A 0.0 2.0 0.0 1.0
1 25 F A 1.0 2.0 0.0 0.0
6 33 M A 1.0 1.0 1.0 0.0
4 56 M A 2.0 0.0 1.0 0.0
3 18 M B 0.0 1.0 0.0 1.0
5 28 F B 1.0 1.0 0.0 0.0
2 33 M B 1.0 0.0 1.0 0.0
And if you wish to get the new columns as int (as in your example), use:
for new_col in ['M_Above', 'M_Below', 'F_Above', 'F_Below']:
df[new_col] = df[new_col].astype(int)
Which results:
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0
EDIT: (running times comparison)
Note that this solution is faster than using ix (the approved solution). Average running time (over 1000 iterations) is ~6 times faster (which would probably matter in bigger DataFrames). Run this to check:
import pandas as pd
from time import time
import numpy as np
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
times = []
times2 = []
for i in range(1000):
tic = time()
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
toc = time()
times.append(toc-tic)
for i in range(1000):
tic = time()
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
toc = time()
times2.append(toc-tic)
print(np.mean(times))
print(np.mean(times2))
Results:
0.0163134906292 # alternative solution
0.0622982912064 # approved solution
You can apply custom function f with groupby by column Team.
In function f for each row first filter above and below values by ix, then drop value and get values desired values by value_counts. Some values are missing, so need reindex and then select by ix:
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
#cast float to int
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
print (df1)
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0