I have a pandas df:
number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472
I am trying to group lines that have different numbers, but where the start is within +/- 10 AND the end is also within +/- 10 on the same chromosomes.
In this example I want to find these two lines:
24 s5 3 17286051 3 17311472
26 s4 3 17286052 3 17311472
Where both have the same chrom1 [3] and chrom2 [3] , and the start and end values are +/- 10 from each other, and group them under the same number:
24 s5 3 17286051 3 17311472
24 s4 3 17286052 3 17311472 # Change the number to the first seen in this series
Here's what I'm trying:
import pandas as pd
from collections import defaultdict
def parse_vars(inFile):
df = pd.read_csv(inFile, delimiter="\t")
df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]
vars = {}
seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`
for index in df.index:
event = df.loc[index, 'number']
c1 = df.loc[index, 'chrom1']
b1 = int(df.loc[index, 'start'])
c2 = df.loc[index, 'chrom2']
b2 = int(df.loc[index, 'end'])
print [event, c1, b1, c2, b2]
vars[event] = [c1, b1, c2, b2]
# Iterate over windows +/- 10
for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
# if :
# i in seen_l[c1] AND
# j in seen_r[c2] AND
# the 'number' for these two instances is the same:
if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
print seen_l[c1][i], seen_r[c2][j]
if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])
seen_l[c1][b1] = event
seen_r[c2][b2] = event
The problem I'm having, is that seen_l[3][17286052] exists in both numbers 25 and 26, and as their respective seen_r events (seen_r[3][17294628] = 25, seen_r[3][17311472] = 26) are not equal, I am unable to join these lines together.
Is there a way that I can use a list of start values as the nested key for seen_l dict?
Interval overlaps are easy in pyranges. Most of the code below is to separate out the starts and ends into two different dfs. Then these are joined based on an interval overlap of +-10:
from io import StringIO
import pandas as pd
import pyranges as pr
c = """number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472"""
df = pd.read_table(StringIO(c), sep="\s+")
df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)
df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)
names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome | Start | End | number | sample | Start_b | End_b | number_b | sample_b |
# | (category) | (int32) | (int32) | (int64) | (object) | (int32) | (int32) | (int64) | (object) |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3 | 3125694 | 3125695 | 20 | s4 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125694 | 3125695 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125695 | 3125696 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125700 | 3125701 | 20 | s5 | 3125700 | 3125699 | 19 | s2 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 25 | s2 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s5 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s1 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s4 |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
# to get the data as a pandas df:
jdf = j.df
Related
Problem:
I have a DataFrame like so:
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
name | category | amount
----------------------------
0 john | a | 100
1 jim | b | 200
2 eric | c | 13
3 jim | b | 23
4 john | a | 40
5 jim | b | 2
6 jim | c | 43
7 eric | c | 92
8 eric | a | 83
9 john | c | 1
I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 + 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).
Desired output:
The output I'm looking for here looks like this:
name | category | amount | sum_for_category | count_for_category
------------------------------------------------------------------------
0 john | a | 100 | 140 | 2
1 jim | b | 200 | 225 | 3
2 eric | c | 13 | 105 | 2
3 jim | b | 23 | 225 | 3
4 john | a | 40 | 140 | 2
5 jim | b | 2 | 225 | 3
6 jim | c | 43 | 43 | 1
7 eric | c | 92 | 105 | 2
8 eric | a | 83 | 83 | 1
9 john | c | 1 | 1 | 1
I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.
Best I could do:
I can't find a good way to do this. The best I've been able to come up with is the following:
names = df["name"].unique()
categories = df["category"].unique()
sum_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)
count_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)
But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.
You need two groupby.transform:
g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')
output:
name category amount sum_for_category count_or_category
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
Another possible solution:
g = df.groupby(['name', 'category']).amount.agg(['sum','count']).reset_index()
df.merge(g, on = ['name', 'category'], how = 'left')
Output:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df_Count =
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2
Hi There,
Use a simple code to easy understand, please try these code below, Just run it you will get what you want.
Thanks
Leon
I have a somewhat involved transformation of my data where I was wondering if someone had a more efficient method than mine. I start with a dataframe as this one:
| | item | value |
|---:|:------------------|--------:|
| 0 | WAUZZZF23MN053792 | 0 |
| 1 | A | 1 |
| 2 | WF0TK3SS2MMA50940 | 0 |
| 3 | A | 10 |
| 4 | B | 11 |
| 5 | C | 12 |
| 6 | D | 13 |
| 7 | E | 14 |
| 8 | W0VEAZKXZMJ857138 | 0 |
| 9 | A | 20 |
| 10 | B | 21 |
| 11 | C | 22 |
| 12 | D | 23 |
| 13 | E | 24 |
| 14 | W0VEAZKXZMJ837930 | 0 |
| 15 | A | 30 |
| 16 | B | 31 |
| 17 | C | 32 |
| 18 | D | 33 |
| 19 | E | 34 |
and I would like to arrive here:
| | item | value | C |
|---:|:------------------|--------:|----:|
| 0 | WAUZZZF23MN053792 | 0 | nan |
| 1 | WF0TK3SS2MMA50940 | 0 | 12 |
| 2 | W0VEAZKXZMJ857138 | 0 | 22 |
| 3 | W0VEAZKXZMJ837930 | 0 | 32 |
i.e. for every "long" entry, check if there is an item "C" following, and if so, copy that line's value to the line with the long item.
The ugly way i have done this is the following:
import re
import pandas as pd
df = pd.DataFrame(
{
"item": [
"WAUZZZF23MN053792",
"A",
"WF0TK3SS2MMA50940",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ857138",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ837930",
"A",
"B",
"C",
"D",
"E",
],
"value": [
0,
1,
0,
10,
11,
12,
13,
14,
0,
20,
21,
22,
23,
24,
0,
30,
31,
32,
33,
33,
],
}
)
def isVIN(x):
return (len(x) == 17) & (x.upper() == x) & (re.search("\s|O|I", x) is None)
# filter the lines with item=="C" or a VIN in item
x = pd.concat([df, df["item"].rename("group").apply(isVIN).cumsum()], axis=1).loc[
lambda x: (x["item"] == "C") | (x["item"].apply(isVIN))
]
# pivot the lines where item="C"
y = x.loc[x["item"] == "C"].pivot(columns="item").droplevel(level=1, axis=1)
# and then merge the two:
print(
x.loc[x["item"].apply(isVIN)]
.merge(y, on="group", how="left")
.drop("group", axis=1)
.rename(columns={"value_y": "C", "value_x": "value"})
.to_markdown()
)
Does anyone have an idea how to make this a bit less ugly?
Subjectively less ugly
mask = df.item.str.len().eq(17)
df.set_index(
[df.item.where(mask).ffill(), 'item']
)[~mask.to_numpy()].value.unstack()['C'].reset_index()
item C
0 W0VEAZKXZMJ837930 32.0
1 W0VEAZKXZMJ857138 22.0
2 WAUZZZF23MN053792 NaN
3 WF0TK3SS2MMA50940 12.0
A bit more involved but better
mask = df.item.str.len().eq(17)
item = df.item.where(mask).pad()
subs = df.item.mask(mask)
valu = df.value
i, r = pd.factorize(item)
j, c = pd.factorize(subs)
a = np.zeros((len(r), len(c)), valu.dtype)
a[i, j] = valu
pd.DataFrame(a, r, c)[['C']].rename_axis('item').reset_index()
item C
0 WAUZZZF23MN053792 0
1 WF0TK3SS2MMA50940 12
2 W0VEAZKXZMJ857138 22
3 W0VEAZKXZMJ837930 32
Try:
# Your conditions vectorized
m = ((df['item'].str.len() == 17)
& (df['item'].str.upper() == df['item'])
& (~df['item'].str.contains(r'\s|O|I')))
# Create virtual groups to align rows
df['grp'] = m.cumsum()
# Merge and align rows
out = (pd.concat([df[m].set_index('grp'),
df[~m].pivot('grp', 'item', 'value')], axis=1)
.reset_index(drop=True))
Output:
>>> out
item value A B C D E
0 WAUZZZF23MN053792 0 1.0 NaN NaN NaN NaN
1 WF0TK3SS2MMA50940 0 10.0 11.0 12.0 13.0 14.0
2 W0VEAZKXZMJ857138 0 20.0 21.0 22.0 23.0 24.0
3 W0VEAZKXZMJ837930 0 30.0 31.0 32.0 33.0 33.0
The other answers are all very nice. For a bit more variety, you could also filter df for "long" data and C values; concat; then "compress" the DataFrame using groupby + first:
out = pd.concat([df[df['item'].str.len()==17],
df.loc[df['item']=='C', ['value']].set_axis(['C'], axis=1)], axis=1)
out = out.groupby(out['item'].str.len().eq(17).cumsum()).first().reset_index(drop=True)
Output:
item value C
0 WAUZZZF23MN053792 0.0 NaN
1 WF0TK3SS2MMA50940 0.0 12.0
2 W0VEAZKXZMJ857138 0.0 22.0
3 W0VEAZKXZMJ837930 0.0 32.0
How about this with datar, a pandas wrapper that reimagines pandas APIs:
Construct data
>>> import re
>>> from datar.all import (
... c, f, LETTERS, tibble, first, cumsum,
... mutate, group_by, slice, first, pivot_wider, select
... )
>>>
>>> df = tibble(
... item=c(
... "WAUZZZF23MN053792",
... "A",
... "WF0TK3SS2MMA50940",
... LETTERS[:5],
... "W0VEAZKXZMJ857138",
... LETTERS[:5],
... "W0VEAZKXZMJ837930",
... LETTERS[:5],
... ),
... value=c(
... 0, 1,
... 0, f[10:15],
... 0, f[20:25],
... 0, f[30:35],
... )
... )
>>> df
item value
<object> <int64>
0 WAUZZZF23MN053792 0
1 A 1
2 WF0TK3SS2MMA50940 0
3 A 10
4 B 11
5 C 12
6 D 13
7 E 14
8 W0VEAZKXZMJ857138 0
9 A 20
10 B 21
11 C 22
12 D 23
13 E 24
14 W0VEAZKXZMJ837930 0
15 A 30
16 B 31
17 C 32
18 D 33
19 E 34
Manipulate data
>>> def isVIN(x):
... return len(x) == 17 and x.isupper() and re.search(r"\s|O|I", x) is None
...
>>> (
... df
... # Mark the VIN groups
... >> mutate(is_vin=cumsum(f.item.transform(isVIN)))
... # Group by VINs
... >> group_by(f.is_vin)
... # Put the VINs and their values in new columns
... >> mutate(vin=first(f.item), vin_value=first(f.value))
... # Exclude VINs in the items
... >> slice(~c(0))
... # Get the values of A, B, C ...
... >> pivot_wider([f.vin, f.vin_value], names_from=f.item, values_from=f.value)
... # Select and rename columns
... >> select(item=f.vin, value=f.vin_value, C=f.C)
... )
item value C
<object> <int64> <float64>
0 W0VEAZKXZMJ837930 0 32.0
1 W0VEAZKXZMJ857138 0 22.0
2 WAUZZZF23MN053792 0 NaN
3 WF0TK3SS2MMA50940 0 12.0
I am having a hard time figuring out how to get "rolling weights" based off of one of my columns, then factor these weights onto another column.
I've tried groupby.rolling.apply (function) on my data but the main problem is just conceptualizing how I'm going to take a running/rolling average of the column I'm going to turn into weights, and then factor this "window" of weights onto another column that isn't rolled.
I'm also purposely setting min_period to 1, so you'll notice my first two rows in each group final output "rwag" mirror the original.
W is the rolling column to derive the weights from.
B is the column to apply the rolled weights to.
Grouping is only done on column a.
df is already sorted by a and yr.
def wavg(w,x):
return (x * w).sum() / w.sum()
n=df.groupby(['a1'])[['w']].rolling(window=3,min_periods=1).apply(lambda x: wavg(df['w'],df['b']))
Input:
id | yr | a | b | w
---------------------------------
0 | 1990 | a1 | 50 | 3000
1 | 1991 | a1 | 40 | 2000
2 | 1992 | a1 | 10 | 1000
3 | 1993 | a1 | 20 | 8000
4 | 1990 | b1 | 10 | 500
5 | 1991 | b1 | 20 | 1000
6 | 1992 | b1 | 30 | 500
7 | 1993 | b1 | 40 | 4000
Desired output:
id | yr | a | b | rwavg
---------------------------------
0 1990 a1 50 50
1 1991 a1 40 40
2 1992 a1 10 39.96
3 1993 a1 20 22.72
4 1990 b1 10 10
5 1991 b1 20 20
6 1992 b1 30 20
7 1993 b1 40 35.45
apply with rolling usually have some wired behavior
df['Weight']=df.b*df.w
g=df.groupby(['a']).rolling(window=3,min_periods=1)
g['Weight'].sum()/g['w'].sum()
df['rwavg']=(g['Weight'].sum()/g['w'].sum()).values
Out[277]:
a
a1 0 50.000000
1 46.000000
2 40.000000
3 22.727273
b1 4 10.000000
5 16.666667
6 20.000000
7 35.454545
dtype: float64
Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()
I use package "pandas" for python. And i have a question.
I have DataFrame like this:
| first | last | datr |city|
|Zahir |Petersen|22.11.15|9 |
|Zahir |Petersen|22.11.15|2 |
|Mason |Sellers |10.04.16|4 |
|Gannon |Cline |29.10.15|2 |
|Craig |Sampson |20.04.16|2 |
|Craig |Sampson |20.04.16|4 |
|Cameron |Mathis |09.05.15|6 |
|Adam |Hurley |16.04.16|2 |
|Brock |Vaughan |14.04.16|10 |
|Xanthus |Murray |30.03.15|6 |
|Xanthus |Murray |30.03.15|7 |
|Xanthus |Murray |30.03.15|4 |
|Palmer |Caldwell|31.10.15|2 |
I want create pivot_table by fields ['first', 'last', 'datr'], but display
['first', 'last', 'datr','city'] where count of record by ['first', 'last', 'datr'] more than one, like this:
| first | last | datr |city|
|Zahir |Petersen|22.11.15|9 | 2
| | | |2 | 2
|Craig |Sampson |20.04.16|2 | 2
| | | |4 | 2
|Xanthus |Murray |30.03.15|6 | 3
| | | |7 | 3
| | | |4 | 3
UPD.
If i groupby three fields from four, than
df['count'] = df.groupby(['first','last','datr']).transform('count')
is work, but if count of all columns-columns for "groupby" > 1 than this code throw error. For example(all columns - 4('first','last', 'datr', 'city'), columns for groupby - 2('first','last'), 4-2 = 2:
In [181]: df['count'] = df.groupby(['first','last']).transform('count')
...
ValueError: Wrong number of items passed 2, placement implies 1
You can do this with groupby. Group by the three columns (first, last and datr), and then count the number of elements in each group:
In [63]: df['count'] = df.groupby(['first', 'last', 'datr']).transform('count')
In [64]: df
Out[64]:
first last datr city count
0 Zahir Petersen 22.11.15 9 2
1 Zahir Petersen 22.11.15 2 2
2 Mason Sellers 10.04.16 4 1
3 Gannon Cline 29.10.15 2 1
4 Craig Sampson 20.04.16 2 2
5 Craig Sampson 20.04.16 4 2
6 Cameron Mathis 09.05.15 6 1
7 Adam Hurley 16.04.16 2 1
8 Brock Vaughan 14.04.16 10 1
9 Xanthus Murray 30.03.15 6 3
10 Xanthus Murray 30.03.15 7 3
11 Xanthus Murray 30.03.15 4 3
12 Palmer Caldwell 31.10.15 2 1
From there, you can filter the frame:
In [65]: df[df['count'] > 1]
Out[65]:
first last datr city count
0 Zahir Petersen 22.11.15 9 2
1 Zahir Petersen 22.11.15 2 2
4 Craig Sampson 20.04.16 2 2
5 Craig Sampson 20.04.16 4 2
9 Xanthus Murray 30.03.15 6 3
10 Xanthus Murray 30.03.15 7 3
11 Xanthus Murray 30.03.15 4 3
And if you want these columns as the index (as in the example output in your question): df.set_index(['first', 'last', 'datr'])