Groupby count of values - pandas - python

I'm hoping to count specific values from a pandas df. Using below, I'm subsetting Item by Up and grouping Num and Label to count the values in Item. The values in the output are correct but I want to drop Label and include Up in the column headers.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'Label' : ['A','B','A','B','B','B','A','B','B','A','A','A','B','A','B','A'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right','Down','Right','Up','Up','Right','Down','Left'],
})
df1 = (df[df['Item'] == 'Up']
.groupby(['Num','Label'])['Item']
.count()
.unstack(fill_value = 0)
.reset_index()
)
intended output:
Num A_Up B_Up
1 3 0
2 1 1

With your approach, you can include the Item in the grouper.
out = (df[df['Item'] == 'Up'].groupby(['Num','Label','Item']).size()
.unstack(['Label','Item'],fill_value=0))
out.columns=out.columns.map('_'.join)
print(out)
A_Up B_Up
Num
1 3 0
2 1 1

You can use Groupby.transform to have all column names. Then use df.pivot_table and a list comprehension to get your desired column names.
In [2301]: x = df[df['Item'] == 'Up']
In [2304]: x['c'] = x.groupby(['Num','Label'])['Item'].transform('count')
In [2310]: x = x.pivot_table(index='Num', columns=['Label', 'Item'], aggfunc='first', fill_value=0)
In [2313]: x.columns = [j+'_'+k for i,j,k in x.columns]
In [2314]: x
Out[2314]:
A_Up B_Up
Num
1 3 0
2 1 1

Related

Matching value with column to retrieve index value

Please see example dataframe below:
I'm trying match values of columns X with column names and retrieve value from that matched column
so that:
A B C X result
1 2 3 B 2
5 6 7 A 5
8 9 1 C 1
Any ideas?
Here are a couple of methods:
# Apply Method:
df['result'] = df.apply(lambda x: df.loc[x.name, x['X']], axis=1)
# List comprehension Method:
df['result'] = [df.loc[i, x] for i, x in enumerate(df.X)]
# Pure Pandas Method:
df['result'] = (df.melt('X', ignore_index=False)
.loc[lambda x: x['X'].eq(x['variable']), 'value'])
Here I just build a dataframe from your example and call it df
dict = {
'A': (1,5,8),
'B': (2,6,9),
'C': (3,7,1),
'X': ('B','A','C')}
df = pd.DataFrame(dict)
You can extract the value from another column based on 'X' using the following code. There may be a better way to do this without having to convert first to list and retrieving the first element.
list(df.loc[df['X'] == 'B', 'B'])[0]
I'm going to create a column called 'result' and fill it with 'NA' and then replace the value based on your conditions. The loop below, extracts the value and uses .loc to replace it in your dataframe.
df['result'] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(df.loc[df['X'] == val, val])[0]
df.loc[idx, 'result'] = extracted
Here it is as a function:
def search_replace(dataframe, search_col='X', new_col_name='result'):
dataframe[new_col_name] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(dataframe.loc[dataframe[search_col] == val, val])[0]
dataframe.loc[idx, new_col_name] = extracted
return df
and the output
>>> search_replace(df)
A B C X result
0 1 2 3 B 2
1 5 6 7 A 5
2 8 9 1 C 1

pandas named aggregation without multilevel dataframe

I am trying to remove the multi level but unable to do so.
import pandas as pd
k = pd.DataFrame([['x',2], ['y',4],['x',6]], columns=['name','value'])
agg_item={'value': [('n', 'count')]}
k=k[['name','value']].groupby(['name'],dropna=False).agg(agg_item).reset_index()
k
name value
n
0 x 2
1 y 1
k.columns
​
MultiIndex([( 'name', ''),
('value', 'n')],
)
How do I get sql like table with only 'name' and 'n' columns?
Desired output:
name n
0 x 2
1 y 1
You can use a named aggregation with pd.NamedAgg to avoid creating a MultiIndex in the first place:
n_agg = pd.NamedAgg(column='value', aggfunc='count')
k = k[['name','value']].groupby(['name'],dropna=False).agg(n=n_agg).reset_index()
Output:
>>> k
name n
0 x 2
1 y 1
Or, as #itthrill suggested, you can use .agg(n=('value', 'count')) instead of pd.NamedAgg.
By using a list in your dictionary, you request to have a multindex.
You should use this syntax instead:
agg_item={'n': ('value', 'count')}
(k[['name','value']]
.groupby(['name'],dropna=False)
.agg(**agg_item).
reset_index()
)
NB. Don't forget to unpack the dictionary as parameters
Or without dictionary:
(k[['name','value']]
.groupby(['name'],dropna=False)
.agg(n=('value', 'count')).
reset_index()
)
Output:
name n
0 x 2
1 y 1
You can use a list comprehension to select levels:
k.columns = [col[0] if col[1]=='' else col[1] for col in k.columns]
you can also use or instead of if-else:
k.columns = [col[1] or col[0] for col in k.columns]
Or you can droplevel before reset_index in your groupby:
k=k[['name','value']].groupby(['name'],dropna=False).agg(agg_item).droplevel(0, axis=1).reset_index()
# ^ ^ ^ here
Output:
name n
0 x 2
1 y 1

How can I remove string after last underscore in python dataframe?

I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)

Pandas DataFrame: How to merge left with a second DataFrame on a combination of index and columns

I'm trying to merge two dataframes.
I want to merge on one column, that is the index of the second DataFrame and
one column, that is a column in the second Dataframe. The column/index names are different in both DataFrames.
Example:
import pandas as pd
df2 = pd.DataFrame([(i,'ABCDEFGHJKL'[j], i*2 + j)
for i in range(10)
for j in range(10)],
columns = ['Index','Sub','Value']).set_index('Index')
df1 = pd.DataFrame([['SOMEKEY-A',0,'A','MORE'],
['SOMEKEY-B',4,'C','MORE'],
['SOMEKEY-C',7,'A','MORE'],
['SOMEKEY-D',5,'Z','MORE']
], columns=['key', 'Ext. Index', 'Ext. Sub', 'Description']
).set_index('key')
df1 prints out
key Ext. Index Ext. Sub Description
SOMEKEY-A 0 A MORE
SOMEKEY-B 4 C MORE
SOMEKEY-C 7 A MORE
SOMEKEY-D 5 Z MORE
the first lines of df2 are
Index Sub Value
0 A 0
0 B 1
0 C 2
0 D 3
0 E 4
I want to merge "Ext. Index" and "Ext. Sub" with DataFrame df2, where the index is "Index" and the column is "Sub"
The expected result is:
key Ext. Index Ext. Sub Description Ext. Value
SOMEKEY-A 0 A MORE 0
SOMEKEY-B 4 C MORE 10
SOMEKEY-C 7 A MORE 14
SOMEKEY-D 5 Z MORE None
Manually, the merge works like this
def get_value(x):
try:
return df2[(df2.Sub == x['Ext. Sub']) &
(df2.index == x['Ext. Index'])]['Value'].iloc[0]
except IndexError:
return None
df1['Ext. Value'] = df1.apply(get_value, axis = 1)
Can I do this with a pd.merge or pd.concat command, without
changing the df2 by turning the df2.index into a column?
Try using:
df_new = (df1.merge(df2[['Sub', 'Value']],
how='left',
left_on=['Ext. Index', 'Ext. Sub'],
right_on=[df2.index, 'Sub'])
.set_index(df1.index)
.drop('Sub', axis=1))

Find column names when row element meets a criteria Pandas

This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.
There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)
You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()

Categories