I have the following dataframe:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([[1, 2], ['a', 'b', 'c'], ['a', 'b', 'c']],
names=['one', 'two', 'three'])
df = pd.DataFrame(np.random.rand(18, 3), index=index)
0 1 2
one two three
1 a b 0.002568 0.390393 0.040717
c 0.943853 0.105594 0.738587
b b 0.049197 0.500431 0.001677
c 0.615704 0.051979 0.191894
2 a b 0.748473 0.479230 0.042476
c 0.691627 0.898222 0.252423
b b 0.270330 0.909611 0.085801
c 0.913392 0.519698 0.451158
I want to select rows where combination of index levels two and three are (a, b) or (b, c). How can I do this?
I tried df.loc[(slice(None), ['a', 'b'], ['b', 'c']), :] but that gives me all combinations of [a, b] and [b, c], including (a, c) and (b, b), which aren't needed.
I tried df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])] but that returns NaN in level one of the index.
df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])]
0 1 2
NaN a b NaN NaN NaN
b c NaN NaN NaN
So I thought I needed a slice at level one, but that gives me a TypeError:
pd.MultiIndex.from_tuples([(slice(None), 'a', 'b'), (slice(None), 'b', 'c')])
TypeError: unhashable type: 'slice'
I feel like I'm missing some simple one-liner here :).
Use df.query():
In [174]: df.query("(two=='a' and three=='b') or (two=='b' and three=='c')")
Out[174]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
UPDATE: we can also generate such "query" dynamically:
In [185]: l = [('a','b'), ('b','c')]
In [186]: q = ' or '.join(["(two=='{}' and three=='{}')".format(x,y) for x,y in l])
In [187]: q
Out[187]: "(two=='a' and three=='b') or (two=='b' and three=='c')"
In [188]: df.query(q)
Out[188]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
Here's one approach with loc and get_level_values
In [3231]: idx = df.index.get_level_values
In [3232]: df.loc[((idx('two') == 'a') & (idx('three') == 'b')) |
((idx('two') == 'b') & (idx('three') == 'c'))]
Out[3232]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461
Generic way
In [3262]: conds = [('a', 'b'), ('b', 'c')]
In [3263]: mask = np.column_stack(
[(idx('two') == c[0]) & (idx('three') == c[1]) for c in conds]
).any(1)
In [3264]: df.loc[mask]
Out[3264]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461
Related
I want to group a df by a column col_2, which contains mostly integers, but some cells contain a range of integers. In my real life example, each unique integer represents a specific serial number of an assembled part. Each row in the dataframe represents a single part, which is allocated to the assembled part by col_2. Some parts can only be allocated to the assembled part with a given uncertainty (range).
The expected output would be one single group for each referenced integer (assembled part S/N). For example, the entry col_1 = c should be allocated to both groups where col_2 = 1 and col_2 = 2.
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
print(df.groupby(['col_2']).groups)
The code above gives an error:
TypeError: '<' not supported between instances of 'range' and 'int'
I think this does what you want:
s = df.col_2.apply(pd.Series).set_index(df.col_1).stack().astype(int)
s.reset_index().groupby(0).col_1.apply(list)
The first step gives you:
col_1
a 0 1
b 0 2
c 0 1
1 2
d 0 3
e 0 2
1 3
2 4
f 0 5
And the final result is:
1 [a, c]
2 [b, c, e]
3 [d, e]
4 [e]
5 [f]
Try this:
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
df['col_2'] = df.col_2.map(lambda x: range(x) if type(x) != range else x)
print(df.groupby(['col_2']).groups)```
I have a dataset where two columns have almost perfect correlation, meaning when one column has a certain value there is very high chance that the second column will have another certain value, Example:
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'B': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'd', 'd', 'e', 'e', 'f', 'f', 'g']})
print(df)
Out[6]:
A B
0 1 a
1 1 a
2 1 a
3 1 a
4 1 a
5 1 a
6 2 b
7 2 c
8 3 d
9 3 d
10 4 e
11 4 e
12 5 f
13 5 f
14 5 g
When column A has value of 1, B will have a - that's a perfect correlation as there is no A value of 1 which will have a B value different than a. That is also the case with 3->d, 4->e.
5 and 2 are not perfectly correlated.
How can I find all the A values who has more than one matching B values so I could print them all out?
In this case, my desired output would be something like
find_imperfect_correlations(df, 'A', 'B')
Out[7]:
2 -> 'b', 'c'
5 -> 'f', 'g'
EDIT:
Preferably a generalized answer for when the dtype of B could be ints, dates, etc.
def find_imperfect_correlations(df, col1, col2):
df_out = df.groupby(col1).filter(lambda x: x[col2].nunique()>1).groupby(col1)[col2].apply(lambda x:x.unique())
for i in df_out.iteritems():
print(str(i[0]) + ' -> ' + str(i[1]))
find_imperfect_correlations(df, 'A', 'B')
Output:
2 -> ['b' 'c']
5 -> ['f' 'g']
IIUIC, you can
In [562]: s = df.groupby('A')['B'].unique()
In [563]: s[s.str.len() > 1]
Out[563]:
A
2 [b, c]
5 [f, g]
dtype: object
Or
In [564]: s[s.str.len() > 1].str.join(', ')
Out[564]:
A
2 b, c
5 f, g
dtype: object
This one would also work:
grouped = df.groupby('A').B.nunique()
df_grouped = grouped.to_frame().reset_index()
decorrelated = df_grouped[df_grouped['B'] > 1]
print(decorrelated['A'])
The first line counts the distinct values in column B for each value in column A. The second line converts the resulting series to a dataframe. The third line selects the rows where the number of distinct values is greater than 1. Then the last line prints the A values.
How to print the column headers if the row values are greater than the mean value (or median) of the column.
For Eg.,
df =
a b c d
0 12 11 13 45
1 6 13 12 23
2 5 12 6 35
the output should be 0: a, c, d. 1: a, b, c. 2: b.
In [22]: df.gt(df.mean()).T.agg(lambda x: df.columns[x].tolist())
Out[22]:
0 [a, c, d]
1 [b, c]
2 [d]
dtype: object
or:
In [23]: df.gt(df.mean()).T.agg(lambda x: ', '.join(df.columns[x]))
Out[23]:
0 a, c, d
1 b, c
2 d
dtype: object
You can try this by using pandas , I break down the steps
df=df.reset_index().melt('index')
df['MEAN']=df.groupby('variable')['value'].transform('mean')
df[df.value>df.MEAN].groupby('index').variable.apply(list)
Out[1016]:
index
0 [a, c, d]
1 [b, c]
2 [d]
Name: variable, dtype: object
Use df.apply to generate a mask, which you can then iterate over and index into df.columns:
mask = df.apply(lambda x: x > x.mean())
out = [(i, ', '.join(df.columns[x])) for i, x in mask.iterrows()]
print(out)
[(0, 'a, c, d'), (1, 'b, c'), (2, 'd')]
d = defaultdict(list)
v = df.values
[d[df.index[r]].append(df.columns[c])
for r, c in zip(*np.where(v > v.mean(0)))];
dict(d)
{0: ['a', 'c', 'd'], 1: ['b', 'c'], 2: ['d']}
I have a Dataframe that looks like this:
OwnerID Value
1 A
1 B
1 C
1 D
This is the shortened version, I have thousands of values for OwnerID. I'd like to create pairs for the Value column where each Value is paired with every other Value, and have the result as list of pairs.
For example, for the OwnerID 1, the resultset should be the following lists:
[A,B]
[A,C]
[A,D]
[B,C]
[B,D]
[C,D]
I could write 2 for loops to achieve this, but that wouldn't be very efficient or pythonic. Would someone know a better way to achieve this?
Any help would be much appreciated.
Pandas solution (using .merge() and .query() methods):
Data:
In [10]: df
Out[10]:
OwnerID Value
0 1 A
1 1 B
2 1 C
3 1 D
4 2 X
5 2 Y
6 2 Z
Solution:
In [9]: pd.merge(df, df, on='OwnerID', suffixes=['','2']).query("Value != Value2")
Out[9]:
OwnerID Value Value2
1 1 A B
2 1 A C
3 1 A D
4 1 B A
6 1 B C
7 1 B D
8 1 C A
9 1 C B
11 1 C D
12 1 D A
13 1 D B
14 1 D C
17 2 X Y
18 2 X Z
19 2 Y X
21 2 Y Z
22 2 Z X
23 2 Z Y
If you need only lists:
In [17]: pd.merge(df, df, on='OwnerID', suffixes=['','2']) \
.query("Value != Value2") \
.filter(like='Value').values
Out[17]:
array([['A', 'B'],
['A', 'C'],
['A', 'D'],
['B', 'A'],
['B', 'C'],
['B', 'D'],
['C', 'A'],
['C', 'B'],
['C', 'D'],
['D', 'A'],
['D', 'B'],
['D', 'C'],
['X', 'Y'],
['X', 'Z'],
['Y', 'X'],
['Y', 'Z'],
['Z', 'X'],
['Z', 'Y']], dtype=object)
import itertools as iter
df2 = df.groupby('OwnerID').Value.apply(lambda x: list(iter.combinations(x, 2)))
will return the desired output for each unique owner id
OwnerID
1 [(A, B), (A, C), (A, D), (B, C), (B, D), (C, D)]
itertools is all you need.
Depending on how you want to combine them, try either permutations or combinations, for example.
try itertools
import itertools
list(itertools.combinations(['a','b','c','d'], 2))
#result: [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]
val = df['Value'].values
length = len(val)
pairs = [[val[i],val[j]] for i in xrange(length) for j in xrange(length) if i!=j]
I want to slice a column in a dataframe (which contains only strings) based on the integers from a series. Here is an example:
data = pandas.DataFrame(['abc','scb','dvb'])
indices = pandas.Series([0,1,0])
Then apply some function so I get the following:
0
0 a
1 c
2 d
You can use python to manipulate the lists beforehand.
l1 = ['abc','scb','dvb']
l2 = [0,1,0]
l3 = [l1[i][l2[i]] for i in range(len(l1))]
You get l3 as
['a', 'c', 'd']
Now converting it to DataFrame
data = pd.DataFrame(l3)
You get the desired dataframe
You can use the following vectorized approach:
In [191]: [tuple(x) for x in indices.reset_index().values]
Out[191]: [(0, 0), (1, 1), (2, 0)]
In [192]: data[0].str.extractall(r'(.)') \
.loc[[tuple(x) for x in indices.reset_index().values]]
Out[192]:
0
match
0 0 a
1 1 c
2 0 d
In [193]: data[0].str.extractall(r'(.)') \
.loc[[tuple(x) for x in indices.reset_index().values]] \
.reset_index(level=1, drop=True)
Out[193]:
0
0 a
1 c
2 d
Explanation:
In [194]: data[0].str.extractall(r'(.)')
Out[194]:
0
match
0 0 a
1 b
2 c
1 0 s
1 c
2 b
2 0 d
1 v
2 b
In [195]: data[0].str.extractall(r'(.)').loc[ [ (0,0), (1,1) ] ]
Out[195]:
0
match
0 0 a
1 1 c
Numpy solution:
In [259]: a = np.array([list(x) for x in data.values.reshape(1, len(data))[0]])
In [260]: a
Out[260]:
array([['a', 'b', 'c'],
['s', 'c', 'b'],
['d', 'v', 'b']],
dtype='<U1')
In [263]: pd.Series(a[np.arange(len(data)), indices])
Out[263]:
0 a
1 c
2 d
dtype: object