Aggregate values with corresponding counts in pandas - python

I have pandas dataframe something like:
my_df =
chr PI
2 5
2 5
2 5
2 6
2 6
2 8
2 8
2 8
2 8
2 8
3 5
3 5
3 5
3 5
3 9
3 9
3 9
3 9
3 9
3 9
3 9
3 7
3 7
3 4
......
......
I want to convert it into new dataframe that contains new information on the dataframe, something like:
chr: unique chromosomes
unq_PI : number of unique PIs within each chromosome
PIs : list of "PI" values in that chromosome
PI_freq: length of each "PI" in the respective chromosome
So, expected output would be:
chr unq_PI PIs PI_freq
2 3 5,6,8 3,2,5
3 4 5,9,7,4 4,7,2,1
I was thinking something like:
new_df = pd.DataFrame({'chr': my_df['chr'].unique(),
'unq_PI': my_df('chr')['unq_PI'].nunique()),
'PIs': .......................,
'PI_freq': ..................})
The only code that works is for `chr` when used alone; any additional code just throws an error. How can I fix this?

Use groupby + value_counts, followed by groupby + agg.
v = (df.groupby('chr')
.PI
.apply(pd.Series.value_counts, sort=False)
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.rename(columns={'level_1' : 'PIs', 'PI' : 'PI_freq'})
)
This doesn't account for the count of unique values, that can be computed using groupby + nunique:
v.insert(0, 'unq_PI', df.groupby('chr').PI.nunique())
v
unq_PI PIs PI_freq
chr
2 3 5,6,8 3,2,5
3 4 4,5,7,9 1,4,2,7

You can using value_counts
yourdf=pd.concat([s.nunique(),s.value_counts().to_frame('n').reset_index().groupby('chr').agg(lambda x : ','.join(x.astype(str)))],1)
yourdf
Out[90]:
PI PI n
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1
yourdf.columns=['unq_PI','PIs','PI_freq']
yourdf
Out[93]:
unq_PI PIs PI_freq
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1

If order is important use custom function:
def f(x):
a = x.value_counts().astype(str).reindex(x.unique())
i = ['unq_PI','PIs','PI_freq']
return pd.Series([x.nunique(), ','.join(a.index), ','.join(a)], index=i)
df = df['PI'].astype(str).groupby(df['chr'], sort=False).apply(f).unstack().reset_index()
Another solution:
df = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1
Explanation:
You can groupby by both columns and get size for unique values of PI and their frequencies per group. Then reset_index for second level of MultiIndex to column and cast to string:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
)
print (df1)
PIs PI_freq
chr
2 5 3
2 6 2
2 8 5
3 5 4
3 9 7
3 7 2
3 4 1
Then groupby by index by level=0 and aggreate join:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
)
print (df1)
PIs PI_freq
chr
2 5,6,8 3,2,5
3 5,9,7,4 4,7,2,1
Last get number of unique values by count with assign for new column, reindex for custom order of final columns:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df1)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1

Related

How to apply .astype() method to a dataframe in Python?

I want to convert multiple columns in a dataframe (pandas) to the type "category" using the method .astype. Here is my code:
df['Field_1'].astype('category').cat.codes
works however
categories = df.select_types('objects')
categories['Field_1'].cat.codes
doesn't.
Would someone please tell my why?
In general, the question is how to apply a method (.astype) to a dataframe? I know how to apply a method to a column in a dataframe, however, applying it to a dataframe hasnt been successful, even with for loop since the for loop returns a series and the method .cat.codes is not appliable for the series.
I think you need processing each column separately in DataFrame.apply and lambda function, your code failed, because Series.cat.codes is not implemented for DataFrame:
df = pd.DataFrame({
'A':list('acbdac'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':list('dddbbb')
})
cols = df.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
Similar idea, not sure if same output if convert all columns to categorical in first step by DataFrame.astype:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].astype('category').apply(lambda x: x.cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0

How to create pairs of column names based on a condition?

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.
Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Calculating CAGR by row by row in a pandas data frame?

I am working with company data. I have a data set of round about 1900 companies (index) and 30 variables per company (columns). These varibales always come in pairs of three (three periods). It basically looks like this
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8]})
I am trying to compute the compound annual growth rate (CAGR) for e.g. revenue for each company (id) - such that revenue_cagr = ((revenue_2/revenue_1)^(1/3))-1
I would like to pass a function to a set of columns row by row - at least, that is my idea.
def CAGR(start_value, end_value, periods):
((end_value/start_value)^(1/periods))-1
Is it possible to apply this function row by row for a set of columns (maybe with for i, row in df.iterrows(): or df.apply())? Respectively, is there a smarter way to do this?
Update
The desired outcome - examplified with the column revenue_cagr - should look as follows:
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8],
'revenue_cagr' : [-0.48, 0.65, -0.07, 0.26],
'profit_cagr' : [0.19, -0.06, 0.31, 0.26]
})
You can use set_index + str.rsplit for triples first:
df1 = df.set_index('id')
df1.columns = df1.columns.str.rsplit('_', expand=True, n=1)
print (df1)
profit revenue
0 1 2 0 1 2
id
1 3 4 5 7 5 1
2 6 6 5 2 6 9
3 4 9 9 5 3 4
7 4 1 8 4 1 8
Then divide by div all 2 with 0 levels selected by xs, add pow, sub and add_suffix:
df1 = df1.xs('2', axis=1, level=1)
.div(df1.xs('0', axis=1, level=1))
.pow((1./3))
.sub(1)
.add_suffix('_cagr')
print (df1)
profit_cagr revenue_cagr
id
1 0.185631 -0.477242
2 -0.058964 0.650964
3 0.310371 -0.071682
7 0.259921 0.259921
Last join to original:
df = df.join(df1, on='id')
print (df)
id profit_0 profit_1 profit_2 revenue_0 revenue_1 revenue_2 \
0 1 3 4 5 7 5 1
1 2 6 6 5 2 6 9
2 3 4 9 9 5 3 4
3 7 4 1 8 4 1 8
profit_cagr revenue_cagr
0 0.185631 -0.477242
1 -0.058964 0.650964
2 0.310371 -0.071682
3 0.259921 0.259921

Pandas better way for Sorting, Grouping, Summing

New to Pandas so wondering if there is a more Pandithic (coining it!) way to sort some data, group it, and then sum part of it. The problem is to find the 3 largest values in a series of values and then sum only them.
census_cp is a dataframe with information about counties of states. My current solution is:
cen_sort = census_cp.groupby('STNAME').head(3)
cen_sort = cen_sort.groupby('STNAME').sum().sort_values(by='CENSUS2010POP', ascending=False).head(n=3)
cen_sort = cen_sort.reset_index()
print(cen_sort['STNAME'].values.tolist())
Im specifically curious if there is a better way to do this as well as why i cant put the sum at the end of the previous line and chain together what seems to me to be obviously connected items (get the top 3 of each and add them together).
I think you can use head with sum first with groupby and then nlargest:
df = census_cp.groupby('STNAME')
.apply(lambda x: x.head(3).sum(numeric_only=True))
.reset_index()
.nlargest(3, 'CENSUS2010POP')
Sample:
census_cp = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CENSUS2010POP':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (census_cp)
CENSUS2010POP STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = census_cp.groupby('STNAME') \
.apply(lambda x: x.head(3).sum(numeric_only=True)) \
.reset_index() \
.nlargest(3, 'CENSUS2010POP')
print (df)
STNAME CENSUS2010POP
5 s 17
1 b 14
2 c 11
If need double top 3 nlargest per groups and then nlargest of summed values use:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP']
.apply(lambda x: x.nlargest(3).sum())
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13
Or:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP'].nlargest(3)
.groupby(level=0)
.sum()
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13

Categories