Sort grouped dataframe

Sort grouped dataframe - python

Now I have a dataframe below.
Type Major GPA
1 A 0
2 B 1
3 C 0
4 A 0
5 B 0
6 C 1
I would like to groupby('Major', sort=False), but sort the outer group by referencing col 'GPA'
The desired dataframe would be like this:
Type Major GPA
2 B 1
5 B 0
6 C 1
3 C 0
1 A 0
4 A 0
How this can be done? Thanks!!

Let us use transform create the additional key
out = df.assign(key = df.groupby('Major')['GPA'].transform('sum')).sort_values(['key','Major','GPA'],ascending = [False,True,False]).drop('key',1)
Out[37]:
Type Major GPA
1 2 B 1
4 5 B 0
5 6 C 1
2 3 C 0
0 1 A 0
3 4 A 0

This might work:
def my_order(x):
order = {'B': 0, 'C': 1, 'A': 2}
return order[x]
df.sort_values(['Major', 'GPA'], ascending=[True, False], key=my_order)

Related

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?

You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method

You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0

import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

How can I remove a certain type of values in a group in pandas?

I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?

We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.

You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c

First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?

Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))

You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2

Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!

pandas finds indices of rows in each group which meets certain condition and assign values to these rows

I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c

Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)

Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c

Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c

Pandas dataframe manipulation

I'm trying perform a specific operation on a dataframe.
Given the following dataframe:
df1 = pd.DataFrame({
'id': [0, 1, 2, 1, 3, 0],
'letter': ['a','b','c','b','b','a'],
'status':[0,1,0,0,0,1]})
id letter status
0 a 0
1 b 1
2 c 0
1 b 0
3 b 0
0 a 1
I'd like to create another dataframe which contains rows from df1 based on the following restriction.
If 2 or more rows have the same id and letter, then return whichever row has a status of 1. All other rows must be copied over.
The resulting dataframe should look like this:
id letter status
0 a 1
1 b 1
2 c 0
3 b 0
Any help is greatly appreciated. Thank you

this should work:
>>> fn = lambda obj: obj[obj.status == 1] if any(obj.status == 1) else obj
>>> df.groupby(['id', 'letter'], as_index=False).apply(fn)
id letter status
5 0 a 1
1 1 b 1
2 2 c 0
4 3 b 0
[4 rows x 3 columns]

sort by status first and then use groupby
In [1932]: df.sort_values(by='status').groupby('id', as_index=False).last()
Out[1932]:
id letter status
0 0 a 1
1 1 b 1
2 2 c 0
3 3 b 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort grouped dataframe - python

Now I have a dataframe below. Type Major GPA 1 A 0 2 B 1 3 C 0 4 A 0 5 B 0 6 C 1 I would like to groupby('Major', sort=False), but sort the outer group by referencing col 'GPA' The desired dataframe would be like this: Type Major GPA 2 B 1 5 B 0 6 C 1 3 C 0 1 A 0 4 A 0 How this can be done? Thanks!!

Let us use transform create the additional key out = df.assign(key = df.groupby('Major')['GPA'].transform('sum')).sort_values(['key','Major','GPA'],ascending = [False,True,False]).drop('key',1) Out[37]: Type Major GPA 1 2 B 1 4 5 B 0 5 6 C 1 2 3 C 0 0 1 A 0 3 4 A 0

This might work: def my_order(x): order = {'B': 0, 'C': 1, 'A': 2} return order[x] df.sort_values(['Major', 'GPA'], ascending=[True, False], key=my_order)

Related

Copy data from a row to another row in Pandas dataframe based on condition

How can I remove a certain type of values in a group in pandas?

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

pandas finds indices of rows in each group which meets certain condition and assign values to these rows

Pandas dataframe manipulation

Categories

Resources