Find columns where values are greater than column-wise mean - python

How to print the column headers if the row values are greater than the mean value (or median) of the column.
For Eg.,
df =
a b c d
0 12 11 13 45
1 6 13 12 23
2 5 12 6 35
the output should be 0: a, c, d. 1: a, b, c. 2: b.

In [22]: df.gt(df.mean()).T.agg(lambda x: df.columns[x].tolist())
Out[22]:
0 [a, c, d]
1 [b, c]
2 [d]
dtype: object
or:
In [23]: df.gt(df.mean()).T.agg(lambda x: ', '.join(df.columns[x]))
Out[23]:
0 a, c, d
1 b, c
2 d
dtype: object

You can try this by using pandas , I break down the steps
df=df.reset_index().melt('index')
df['MEAN']=df.groupby('variable')['value'].transform('mean')
df[df.value>df.MEAN].groupby('index').variable.apply(list)
Out[1016]:
index
0 [a, c, d]
1 [b, c]
2 [d]
Name: variable, dtype: object

Use df.apply to generate a mask, which you can then iterate over and index into df.columns:
mask = df.apply(lambda x: x > x.mean())
out = [(i, ', '.join(df.columns[x])) for i, x in mask.iterrows()]
print(out)
[(0, 'a, c, d'), (1, 'b, c'), (2, 'd')]

d = defaultdict(list)
v = df.values
[d[df.index[r]].append(df.columns[c])
for r, c in zip(*np.where(v > v.mean(0)))];
dict(d)
{0: ['a', 'c', 'd'], 1: ['b', 'c'], 2: ['d']}

Related

change value in pandas dataframe using iteration

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

value column in other column containing list of values

Column A contains strings. Column B contains list of strings.
I would like to know how many times is the value of A in the column B.
I have
A
B
k
[m]
c
[k,l,m]
j
[k,l]
e
[e,m]
e
[e,m,c,e]
I would like this output:
C
0
0
0
1
2
You can use apply on row then use .count list like below.
Try this:
>>> df = pd.DataFrame({'A': ['k', 'c', 'j', 'e', 'e'],'B': [['m'],['k','l','m'],['k','l'],['e','m'],['e','m','c','e']]})
>>> df['C'] = df.apply(lambda row: row['B'].count(row['A']), axis=1)
>>> df
A B C
0 k [m] 0
1 c [k, l, m] 0
2 j [k, l] 0
3 e [e, m] 1
4 e [e, m, c, e] 2

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

Operations on multiple Dataframes in Python

Data frames are provided:
a = pd.DataFrame({'A':[1, 2]})
b = pd.DataFrame({'B':[2, 3]})
C = pd.DataFrame({'C':[4, 5]})
and list d = [A, C, B, B]
How to write an mathematical operations (((A + C) * B) - B) on frame values to create a new data frame?
The result is, for example, a frame in the form:
e = pd.DataFrame({'E':[8, 18]})
IIUC:
In [132]: formula = "E = (((A + C) * B) - B)"
In [133]: pd.concat([a,b,C], axis=1).eval(formula, inplace=False)
Out[133]:
A B C E
0 1 2 4 8
1 2 3 5 18
In [134]: pd.concat([a,b,C], axis=1).eval(formula, inplace=False)[['E']]
Out[134]:
E
0 8
1 18

How can I select combinations of levels in a pandas Multindex?

I have the following dataframe:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([[1, 2], ['a', 'b', 'c'], ['a', 'b', 'c']],
names=['one', 'two', 'three'])
df = pd.DataFrame(np.random.rand(18, 3), index=index)
0 1 2
one two three
1 a b 0.002568 0.390393 0.040717
c 0.943853 0.105594 0.738587
b b 0.049197 0.500431 0.001677
c 0.615704 0.051979 0.191894
2 a b 0.748473 0.479230 0.042476
c 0.691627 0.898222 0.252423
b b 0.270330 0.909611 0.085801
c 0.913392 0.519698 0.451158
I want to select rows where combination of index levels two and three are (a, b) or (b, c). How can I do this?
I tried df.loc[(slice(None), ['a', 'b'], ['b', 'c']), :] but that gives me all combinations of [a, b] and [b, c], including (a, c) and (b, b), which aren't needed.
I tried df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])] but that returns NaN in level one of the index.
df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])]
0 1 2
NaN a b NaN NaN NaN
b c NaN NaN NaN
So I thought I needed a slice at level one, but that gives me a TypeError:
pd.MultiIndex.from_tuples([(slice(None), 'a', 'b'), (slice(None), 'b', 'c')])
TypeError: unhashable type: 'slice'
I feel like I'm missing some simple one-liner here :).
Use df.query():
In [174]: df.query("(two=='a' and three=='b') or (two=='b' and three=='c')")
Out[174]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
UPDATE: we can also generate such "query" dynamically:
In [185]: l = [('a','b'), ('b','c')]
In [186]: q = ' or '.join(["(two=='{}' and three=='{}')".format(x,y) for x,y in l])
In [187]: q
Out[187]: "(two=='a' and three=='b') or (two=='b' and three=='c')"
In [188]: df.query(q)
Out[188]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
Here's one approach with loc and get_level_values
In [3231]: idx = df.index.get_level_values
In [3232]: df.loc[((idx('two') == 'a') & (idx('three') == 'b')) |
((idx('two') == 'b') & (idx('three') == 'c'))]
Out[3232]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461
Generic way
In [3262]: conds = [('a', 'b'), ('b', 'c')]
In [3263]: mask = np.column_stack(
[(idx('two') == c[0]) & (idx('three') == c[1]) for c in conds]
).any(1)
In [3264]: df.loc[mask]
Out[3264]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461

Categories