Now I have a dataframe below.
Type Major GPA
1 A 0
2 B 1
3 C 0
4 A 0
5 B 0
6 C 1
I would like to groupby('Major', sort=False), but sort the outer group by referencing col 'GPA'
The desired dataframe would be like this:
Type Major GPA
2 B 1
5 B 0
6 C 1
3 C 0
1 A 0
4 A 0
How this can be done? Thanks!!
Let us use transform create the additional key
out = df.assign(key = df.groupby('Major')['GPA'].transform('sum')).sort_values(['key','Major','GPA'],ascending = [False,True,False]).drop('key',1)
Out[37]:
Type Major GPA
1 2 B 1
4 5 B 0
5 6 C 1
2 3 C 0
0 1 A 0
3 4 A 0
This might work:
def my_order(x):
order = {'B': 0, 'C': 1, 'A': 2}
return order[x]
df.sort_values(['Major', 'GPA'], ascending=[True, False], key=my_order)
Related
Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?
We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.
You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c
I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?
Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))
You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!
I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c
Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)
Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
I'm trying perform a specific operation on a dataframe.
Given the following dataframe:
df1 = pd.DataFrame({
'id': [0, 1, 2, 1, 3, 0],
'letter': ['a','b','c','b','b','a'],
'status':[0,1,0,0,0,1]})
id letter status
0 a 0
1 b 1
2 c 0
1 b 0
3 b 0
0 a 1
I'd like to create another dataframe which contains rows from df1 based on the following restriction.
If 2 or more rows have the same id and letter, then return whichever row has a status of 1. All other rows must be copied over.
The resulting dataframe should look like this:
id letter status
0 a 1
1 b 1
2 c 0
3 b 0
Any help is greatly appreciated. Thank you
this should work:
>>> fn = lambda obj: obj[obj.status == 1] if any(obj.status == 1) else obj
>>> df.groupby(['id', 'letter'], as_index=False).apply(fn)
id letter status
5 0 a 1
1 1 b 1
2 2 c 0
4 3 b 0
[4 rows x 3 columns]
sort by status first and then use groupby
In [1932]: df.sort_values(by='status').groupby('id', as_index=False).last()
Out[1932]:
id letter status
0 0 a 1
1 1 b 1
2 2 c 0
3 3 b 0