Drop level for index - python

I have the below result from a pivot table, which is about the count of customer grades that visited my stores. I used the 'droplevel' method to flatten the column header into 1 layer, how can I do the same for the index? I want to remove 'Grade' above the index, so that the column headers are at the same level as 'Store No_'.

it seems you need remove column name:
df.columns.name = None
Or rename_axis:
df = df.rename_axis(None, axis=1)
Sample:
df = pd.DataFrame({'Store No_':[1,2,3],
'A':[4,5,6],
'B':[7,8,9],
'C':[1,3,5],
'D':[5,3,6],
'E':[7,4,3]})
df = df.set_index('Store No_')
df.columns.name = 'Grade'
print (df)
Grade A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
print (df.rename_axis(None, axis=1))
A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Store No_ A B C D E
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3

Related

How to pivot one column into multiple columns in a dataframe?

I have a dataframe of type:
a = ['a','b','c','a','b','c','a','b','c']
b = [0,1,2,3,4,5,6,7,8]
df = pd.DataFrame({'key':a,'values':b})
key values
0 a 0
1 b 1
2 c 2
3 a 3
4 b 4
5 c 5
6 a 6
7 b 7
8 c 8
I want to move the values in the "values" column to new columns where they have the same "key".
So result:
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
From this question How can I pivot a dataframe?
I've tried:
a=d1.pivot_table(index='key',values='values',aggfunc=list).squeeze()
pd.DataFrame(a.tolist(),index=a.index)
Which gives
0 1 2
key
a 0 3 6
b 1 4 7
c 2 5 8
But I don't want the index to be 'key', I want the index to stay the same.
You can use reset_index.
a = df.pivot_table(index='key',values='values',aggfunc=list).squeeze()
out = pd.DataFrame(a.tolist(),index=a.index).add_prefix('values').reset_index()
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
Another way to do it:
out = (df.pivot_table('values', 'key', df.index // 3)
.add_prefix('values').reset_index())
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
df["id"] = df.groupby("key").cumcount()
df.pivot(columns="id", index="key").reset_index()
# key values
# id 0 1 2
# 0 a 0 3 6
# 1 b 1 4 7
# 2 c 2 5 8

Print out pandas groupby without operation

So I have the following pandas dataframe:
import pandas as pd
sample_df = pd.DataFrame({'note': ['D','C','D','C'], 'time': [1,1,4,6], 'val': [6,4,7,9]})
which gives the result
note time val
0 D 1 6
1 C 1 4
2 D 4 7
3 C 6 9
What I want is
note index time val
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I tried sample_df.set_index('note',append=True) and it didn't work.
Add DataFrame.swaplevel with DataFrame.sort_index by first level:
df = sample_df.set_index('note', append=True).swaplevel(1,0).sort_index(level=0)
print (df)
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
If need set level name add DataFrame.rename_axis:
df = (sample_df.rename_axis('idx')
.set_index('note',append=True)
.swaplevel(1,0)
.sort_index(level=0))
print (df)
time val
note idx
C 1 1 4
3 6 9
D 0 1 6
2 4 7
Alternatively:
sample_df.index.rename('old_index', inplace=True)
sample_df.reset_index(inplace=True)
sample_df.set_index(['note','old_index'], inplace=True)
sample_df.sort_index(level=0, inplace=True)
print (sample_df)
time val
note old_index
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I am using MultiIndex create the target index
sample_df.index=pd.MultiIndex.from_arrays([sample_df.note,sample_df.index])
sample_df.drop('note',1,inplace=True)
sample_df=sample_df.sort_index(level=0)
sample_df
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I would use set_index and pop to simultaneously discard column 'note' and set new index
df.set_index([df.pop('note'), df.index]).sort_index(level=0)
Out[380]:
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7

How to update the last column value in all the rows in csv file using python(pandas)

I am trying to update last column value for all the rows in the csv file using Pandas. but while updating the value, other column value are dropping.
file = r'Test.csv'
# Read the file
df = pd.read_csv(file, error_bad_lines=False)
# df.at[3, "ingestion"] = '20'
df.set_value(1, "ingestion", '30')
df.to_csv("Test.csv", index=False, sep='|')
Use DataFrame.iloc with -1 for select last column and : for select all rows:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df.iloc[:, -1] = '20'
print (df)
A B C D E F
0 a 4 7 1 5 20
1 b 5 8 3 3 20
2 c 4 9 5 6 20
3 d 5 4 7 9 20
4 e 5 2 1 2 20
5 f 4 3 0 4 20
EDIT:
For update all rows by last edit value swap -1 with : and get last column value by DataFrame.iat:
df.iloc[-1, :] = df.iat[-1, -1]
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 b b b b b b
pd.DataFrame.set_value is not appropriate for setting all the values in a column. As per the docs, it is used to setting a scalar at a specific row and column label combination.
Moreover, since v0.21, it has been deprecated in favour of .at / .iat accessors.
Instead, you can set the value directly by extracting the final column label, assuming you have no duplicate column names:
df[df.columns[-1]] = '20'
Or, more directly, you can use the iloc accessor:
df.iloc[:, -1] = '20'

How to find the top 5 values of a column according to another column?

I am practising on the IMDB dataset and i would like to find the top genres that had the most budget.
Actually that would be useful in situations where a boxplot is needed and the genres are numerous. Thus, minimising them to the most expensive would make the boxplot more clear.
i tried this: df.sort_values(by=["genres","budget"])
but it isn't right.
If need return all columns:
I think you need sort_values + groupby + head:
df=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(5)
Or nlargest:
df = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(5, "budget"))
If need retun only genres and budget columns:
df = df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
Samples:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aaabbb')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
5 f 4 3 4 0 b
df1=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(2)
df1 = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(2, "budget"))
print (df1)
A B C E budget genres
2 c 4 9 6 5 a
1 b 5 8 3 3 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
df1=df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
print (df1)
genres budget
0 a 5
1 a 3
2 b 7
3 b 1
---
If need top genres with sum of badget per genres:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aabbcc')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 b
3 d 5 4 9 7 b
4 e 5 2 2 1 c
5 f 4 3 4 0 c
df = df.groupby('genres')['budget'].sum().nlargest(2)
print (df)
genres
b 12
a 4
Name: budget, dtype: int64
Detail:
print (df.groupby('genres')['budget'].sum())
genres
a 4
b 12
c 1
Name: budget, dtype: int64

What is the equivalent of a SQL count in Pandas

In sql, select a.*,count(a.id) as N from table a group by a.name would give me a new column 'N'containing the count as per my group by specification.
However in pandas, if I try df['name'].value_counts(), I get the count but not as a column in the original dataframe.
Is there a way to get the count as a column in the original dataframe in a single step/statement?
It seems you need groupby + transform function size:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'name':list('aaabcc')})
print (df)
A B C D E name
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 c
5 f 4 3 0 4 c
df['new'] = df.groupby('name')['name'].transform('size')
print (df)
A B C D E name new
0 a 4 7 1 5 a 3
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 3
3 d 5 4 7 9 b 1
4 e 5 2 1 2 c 2
5 f 4 3 0 4 c 2
What is the difference between size and count in pandas?

Categories