Pandas better way for Sorting, Grouping, Summing

Pandas better way for Sorting, Grouping, Summing - python

New to Pandas so wondering if there is a more Pandithic (coining it!) way to sort some data, group it, and then sum part of it. The problem is to find the 3 largest values in a series of values and then sum only them.
census_cp is a dataframe with information about counties of states. My current solution is:
cen_sort = census_cp.groupby('STNAME').head(3)
cen_sort = cen_sort.groupby('STNAME').sum().sort_values(by='CENSUS2010POP', ascending=False).head(n=3)
cen_sort = cen_sort.reset_index()
print(cen_sort['STNAME'].values.tolist())
Im specifically curious if there is a better way to do this as well as why i cant put the sum at the end of the previous line and chain together what seems to me to be obviously connected items (get the top 3 of each and add them together).

I think you can use head with sum first with groupby and then nlargest:
df = census_cp.groupby('STNAME')
.apply(lambda x: x.head(3).sum(numeric_only=True))
.reset_index()
.nlargest(3, 'CENSUS2010POP')
Sample:
census_cp = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CENSUS2010POP':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (census_cp)
CENSUS2010POP STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = census_cp.groupby('STNAME') \
.apply(lambda x: x.head(3).sum(numeric_only=True)) \
.reset_index() \
.nlargest(3, 'CENSUS2010POP')
print (df)
STNAME CENSUS2010POP
5 s 17
1 b 14
2 c 11
If need double top 3 nlargest per groups and then nlargest of summed values use:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP']
.apply(lambda x: x.nlargest(3).sum())
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13
Or:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP'].nlargest(3)
.groupby(level=0)
.sum()
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13

Related

How to apply .astype() method to a dataframe in Python?

I want to convert multiple columns in a dataframe (pandas) to the type "category" using the method .astype. Here is my code:
df['Field_1'].astype('category').cat.codes
works however
categories = df.select_types('objects')
categories['Field_1'].cat.codes
doesn't.
Would someone please tell my why?
In general, the question is how to apply a method (.astype) to a dataframe? I know how to apply a method to a column in a dataframe, however, applying it to a dataframe hasnt been successful, even with for loop since the for loop returns a series and the method .cat.codes is not appliable for the series.

I think you need processing each column separately in DataFrame.apply and lambda function, your code failed, because Series.cat.codes is not implemented for DataFrame:
df = pd.DataFrame({
'A':list('acbdac'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':list('dddbbb')
})
cols = df.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
Similar idea, not sure if same output if convert all columns to categorical in first step by DataFrame.astype:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].astype('category').apply(lambda x: x.cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.

You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1

Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2

Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)

Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

How to create pairs of column names based on a condition?

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.

Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

Aggregate values with corresponding counts in pandas

I have pandas dataframe something like:
my_df =
chr PI
2 5
2 5
2 5
2 6
2 6
2 8
2 8
2 8
2 8
2 8
3 5
3 5
3 5
3 5
3 9
3 9
3 9
3 9
3 9
3 9
3 9
3 7
3 7
3 4
......
......
I want to convert it into new dataframe that contains new information on the dataframe, something like:
chr: unique chromosomes
unq_PI : number of unique PIs within each chromosome
PIs : list of "PI" values in that chromosome
PI_freq: length of each "PI" in the respective chromosome
So, expected output would be:
chr unq_PI PIs PI_freq
2 3 5,6,8 3,2,5
3 4 5,9,7,4 4,7,2,1
I was thinking something like:
new_df = pd.DataFrame({'chr': my_df['chr'].unique(),
'unq_PI': my_df('chr')['unq_PI'].nunique()),
'PIs': .......................,
'PI_freq': ..................})
The only code that works is for `chr` when used alone; any additional code just throws an error. How can I fix this?

Use groupby + value_counts, followed by groupby + agg.
v = (df.groupby('chr')
.PI
.apply(pd.Series.value_counts, sort=False)
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.rename(columns={'level_1' : 'PIs', 'PI' : 'PI_freq'})
)
This doesn't account for the count of unique values, that can be computed using groupby + nunique:
v.insert(0, 'unq_PI', df.groupby('chr').PI.nunique())
v
unq_PI PIs PI_freq
chr
2 3 5,6,8 3,2,5
3 4 4,5,7,9 1,4,2,7

You can using value_counts
yourdf=pd.concat([s.nunique(),s.value_counts().to_frame('n').reset_index().groupby('chr').agg(lambda x : ','.join(x.astype(str)))],1)
yourdf
Out[90]:
PI PI n
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1
yourdf.columns=['unq_PI','PIs','PI_freq']
yourdf
Out[93]:
unq_PI PIs PI_freq
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1

If order is important use custom function:
def f(x):
a = x.value_counts().astype(str).reindex(x.unique())
i = ['unq_PI','PIs','PI_freq']
return pd.Series([x.nunique(), ','.join(a.index), ','.join(a)], index=i)
df = df['PI'].astype(str).groupby(df['chr'], sort=False).apply(f).unstack().reset_index()
Another solution:
df = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1
Explanation:
You can groupby by both columns and get size for unique values of PI and their frequencies per group. Then reset_index for second level of MultiIndex to column and cast to string:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
)
print (df1)
PIs PI_freq
chr
2 5 3
2 6 2
2 8 5
3 5 4
3 9 7
3 7 2
3 4 1
Then groupby by index by level=0 and aggreate join:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
)
print (df1)
PIs PI_freq
chr
2 5,6,8 3,2,5
3 5,9,7,4 4,7,2,1
Last get number of unique values by count with assign for new column, reindex for custom order of final columns:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df1)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1

Dataframe selecting Max for a column but output values of another

I have a dataframe with values similar to below
A10d B10d C10d A B C Strategy
20 10 5 3 5 1 3
The Strategy selects the max of A10d, B10d, C10d and return the value of A,B,C
In this case A10d is the largest and Strategy returns A, value of 3
I am not sure how to create this Strategy column properly, can anyone advise please? Thank you very much for your help

I think you need iloc for select first columns by positions and then get columns names by max with idxmax and replace 10d by whitespace for match columns. Last create new column by lookup:
print (df)
A10d B10d C10d A B C
0 20 10 5 3 5 1
1 20 100 5 3 5 1
df1 = df.iloc[:,:3]
print (df1)
A10d B10d C10d
0 20 10 5
1 20 100 5
s = df1.idxmax(axis=1).str.replace('10d','')
print (s)
0 A
1 B
dtype: object
df['Strategy'] = df.lookup(df.index, s)
print (df)
A10d B10d C10d A B C Strategy
0 20 10 5 3 5 1 3
1 20 100 5 3 5 1 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.