pandas named aggregation without multilevel dataframe - python

I am trying to remove the multi level but unable to do so.
import pandas as pd
k = pd.DataFrame([['x',2], ['y',4],['x',6]], columns=['name','value'])
agg_item={'value': [('n', 'count')]}
k=k[['name','value']].groupby(['name'],dropna=False).agg(agg_item).reset_index()
k
name value
n
0 x 2
1 y 1
k.columns
​
MultiIndex([( 'name', ''),
('value', 'n')],
)
How do I get sql like table with only 'name' and 'n' columns?
Desired output:
name n
0 x 2
1 y 1

You can use a named aggregation with pd.NamedAgg to avoid creating a MultiIndex in the first place:
n_agg = pd.NamedAgg(column='value', aggfunc='count')
k = k[['name','value']].groupby(['name'],dropna=False).agg(n=n_agg).reset_index()
Output:
>>> k
name n
0 x 2
1 y 1
Or, as #itthrill suggested, you can use .agg(n=('value', 'count')) instead of pd.NamedAgg.

By using a list in your dictionary, you request to have a multindex.
You should use this syntax instead:
agg_item={'n': ('value', 'count')}
(k[['name','value']]
.groupby(['name'],dropna=False)
.agg(**agg_item).
reset_index()
)
NB. Don't forget to unpack the dictionary as parameters
Or without dictionary:
(k[['name','value']]
.groupby(['name'],dropna=False)
.agg(n=('value', 'count')).
reset_index()
)
Output:
name n
0 x 2
1 y 1

You can use a list comprehension to select levels:
k.columns = [col[0] if col[1]=='' else col[1] for col in k.columns]
you can also use or instead of if-else:
k.columns = [col[1] or col[0] for col in k.columns]
Or you can droplevel before reset_index in your groupby:
k=k[['name','value']].groupby(['name'],dropna=False).agg(agg_item).droplevel(0, axis=1).reset_index()
# ^ ^ ^ here
Output:
name n
0 x 2
1 y 1

Related

Groupby count of values - pandas

I'm hoping to count specific values from a pandas df. Using below, I'm subsetting Item by Up and grouping Num and Label to count the values in Item. The values in the output are correct but I want to drop Label and include Up in the column headers.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'Label' : ['A','B','A','B','B','B','A','B','B','A','A','A','B','A','B','A'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right','Down','Right','Up','Up','Right','Down','Left'],
})
df1 = (df[df['Item'] == 'Up']
.groupby(['Num','Label'])['Item']
.count()
.unstack(fill_value = 0)
.reset_index()
)
intended output:
Num A_Up B_Up
1 3 0
2 1 1
With your approach, you can include the Item in the grouper.
out = (df[df['Item'] == 'Up'].groupby(['Num','Label','Item']).size()
.unstack(['Label','Item'],fill_value=0))
out.columns=out.columns.map('_'.join)
print(out)
A_Up B_Up
Num
1 3 0
2 1 1
You can use Groupby.transform to have all column names. Then use df.pivot_table and a list comprehension to get your desired column names.
In [2301]: x = df[df['Item'] == 'Up']
In [2304]: x['c'] = x.groupby(['Num','Label'])['Item'].transform('count')
In [2310]: x = x.pivot_table(index='Num', columns=['Label', 'Item'], aggfunc='first', fill_value=0)
In [2313]: x.columns = [j+'_'+k for i,j,k in x.columns]
In [2314]: x
Out[2314]:
A_Up B_Up
Num
1 3 0
2 1 1

How to get new pandas dataframe with certain columns and rows depending on list elements?

I have such a list:
l = ['A','B']
And such a dataframe df
Name x y
A 1 2
B 2 1
C 2 2
I now want to get a new dataframe where only the entries for Name and x which are included in l are kept.
new_df should look like this:
Name x
A 1
B 2
I was playing around with isin but did not solve this problem.
Use DataFrame.loc with Series.isin:
new_df = df.loc[df.Name.isin(l), ["Name", "x"]]
This should do it:
# assuming Name is the index
new_df = df[df.index.isin(l)]
# if you only want column x
new_df = df.loc[df.index.isin(l), "x"]
simple as that
l = ['A','B']
def make_empty(row):
print(row)
for idx, value in enumerate(row):
row[idx] = value if value in l else ''
return row
df_new = df[df['Name'].isin(l) | df['x'].isin(l)][['Name','x']]
df_new.apply(lambda row: make_empty(row)
Output:
Name x
0 A
1 B

How can I remove string after last underscore in python dataframe?

I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)

Python DataFrame : Split data in rows based on custom value?

I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456

Find column names when row element meets a criteria Pandas

This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.
There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)
You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()

Categories