In sql, select a.*,count(a.id) as N from table a group by a.name would give me a new column 'N'containing the count as per my group by specification.
However in pandas, if I try df['name'].value_counts(), I get the count but not as a column in the original dataframe.
Is there a way to get the count as a column in the original dataframe in a single step/statement?
It seems you need groupby + transform function size:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'name':list('aaabcc')})
print (df)
A B C D E name
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 c
5 f 4 3 0 4 c
df['new'] = df.groupby('name')['name'].transform('size')
print (df)
A B C D E name new
0 a 4 7 1 5 a 3
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 3
3 d 5 4 7 9 b 1
4 e 5 2 1 2 c 2
5 f 4 3 0 4 c 2
What is the difference between size and count in pandas?
Related
I would like to obtain the 'Value' column below, from the original df:
A B C Column_To_Use
0 2 3 4 A
1 5 6 7 C
2 8 0 9 B
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Use DataFrame.lookup:
df['Value'] = df.lookup(df.index, df['Column_To_Use'])
print (df)
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)
how to drop columns with more than 50 kinds of values using function?
here drop columns:date_dispatch,con_birth_dt,dat_cust_open,cust_mgr_team,mng_issu_date,created_date
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
label 1
date_dispatch 2883
con_birth_dt 12617
con_sex_mf 2
dat_cust_open 264
cust_mgr_team 2250
mng_issu_date 1796
um_num 38
created_date 2900
hqck_flag 2
dqck_flag 2
tzck_flag 2
yhlcck_flag 2
bzjck_flag 2
gzck_flag 2
jjsz_flag 2
e_yhlcck_flag 2
zq_flag 2
xtsz_flag 1
whsz_flag 1
hjsz_flag 2
yb_flag 2
qslc_flag 2
Use drop with index values filtered by boolean indexing:
a = app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
df = app_train.drop(a.index[a > 50], axis=1)
Another solution is add reindex for missing columns and then filter by inverted condition <=:
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 50]
Sample:
app_train = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (app_train)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 5]
print (df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
nunique + loc
You can use nunique followed by loc with Boolean indexing:
n = 5 # maximum number of unique values permitted
counts = app_train.select_dtypes(['object']).apply(pd.Series.nunique)
df = app_train.loc[:, ~app_train.columns.isin(counts[counts > n].index)]
# data from jezrael
print(df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
I am practising on the IMDB dataset and i would like to find the top genres that had the most budget.
Actually that would be useful in situations where a boxplot is needed and the genres are numerous. Thus, minimising them to the most expensive would make the boxplot more clear.
i tried this: df.sort_values(by=["genres","budget"])
but it isn't right.
If need return all columns:
I think you need sort_values + groupby + head:
df=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(5)
Or nlargest:
df = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(5, "budget"))
If need retun only genres and budget columns:
df = df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
Samples:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aaabbb')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
5 f 4 3 4 0 b
df1=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(2)
df1 = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(2, "budget"))
print (df1)
A B C E budget genres
2 c 4 9 6 5 a
1 b 5 8 3 3 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
df1=df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
print (df1)
genres budget
0 a 5
1 a 3
2 b 7
3 b 1
---
If need top genres with sum of badget per genres:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aabbcc')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 b
3 d 5 4 9 7 b
4 e 5 2 2 1 c
5 f 4 3 4 0 c
df = df.groupby('genres')['budget'].sum().nlargest(2)
print (df)
genres
b 12
a 4
Name: budget, dtype: int64
Detail:
print (df.groupby('genres')['budget'].sum())
genres
a 4
b 12
c 1
Name: budget, dtype: int64
I am using pandas with python.
I have a column in which the first value is zero.
There are other zeros as well in the column but i don't want to delete them as well.
I want to delete this cell and move the column up by 1 position.
If it is easy i can make the first Zero as an empty cell and then delete but i cant find anything just to delete a specific cell and move the rest of the column up.
SO far i have tried help from existing stack overflow and quora plus github etc but i cant see anything i am looking for.
I believe you need shift first and then replace last NaN value:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
If no NaNs only use fillna for replace NaN:
df['A'] = df['A'].shift(-1).fillna('AAA')
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b
If possible some NaNs in column then set last value by iloc, get_loc function return position of column A:
df['A'] = df['A'].shift(-1)
df.iloc[-1, df.columns.get_loc('A')] = 'AAA'
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b