I have a Dataframe that looks like this:
OwnerID Value
1 A
1 B
1 C
1 D
This is the shortened version, I have thousands of values for OwnerID. I'd like to create pairs for the Value column where each Value is paired with every other Value, and have the result as list of pairs.
For example, for the OwnerID 1, the resultset should be the following lists:
[A,B]
[A,C]
[A,D]
[B,C]
[B,D]
[C,D]
I could write 2 for loops to achieve this, but that wouldn't be very efficient or pythonic. Would someone know a better way to achieve this?
Any help would be much appreciated.
Pandas solution (using .merge() and .query() methods):
Data:
In [10]: df
Out[10]:
OwnerID Value
0 1 A
1 1 B
2 1 C
3 1 D
4 2 X
5 2 Y
6 2 Z
Solution:
In [9]: pd.merge(df, df, on='OwnerID', suffixes=['','2']).query("Value != Value2")
Out[9]:
OwnerID Value Value2
1 1 A B
2 1 A C
3 1 A D
4 1 B A
6 1 B C
7 1 B D
8 1 C A
9 1 C B
11 1 C D
12 1 D A
13 1 D B
14 1 D C
17 2 X Y
18 2 X Z
19 2 Y X
21 2 Y Z
22 2 Z X
23 2 Z Y
If you need only lists:
In [17]: pd.merge(df, df, on='OwnerID', suffixes=['','2']) \
.query("Value != Value2") \
.filter(like='Value').values
Out[17]:
array([['A', 'B'],
['A', 'C'],
['A', 'D'],
['B', 'A'],
['B', 'C'],
['B', 'D'],
['C', 'A'],
['C', 'B'],
['C', 'D'],
['D', 'A'],
['D', 'B'],
['D', 'C'],
['X', 'Y'],
['X', 'Z'],
['Y', 'X'],
['Y', 'Z'],
['Z', 'X'],
['Z', 'Y']], dtype=object)
import itertools as iter
df2 = df.groupby('OwnerID').Value.apply(lambda x: list(iter.combinations(x, 2)))
will return the desired output for each unique owner id
OwnerID
1 [(A, B), (A, C), (A, D), (B, C), (B, D), (C, D)]
itertools is all you need.
Depending on how you want to combine them, try either permutations or combinations, for example.
try itertools
import itertools
list(itertools.combinations(['a','b','c','d'], 2))
#result: [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]
val = df['Value'].values
length = len(val)
pairs = [[val[i],val[j]] for i in xrange(length) for j in xrange(length) if i!=j]
Related
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
I have a list of group IDs:
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
and a dataframe of groups:
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
I'd like to create a column in the dataframe that gives the position of the group ids in the list, like so:
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
My current solution is this:
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups['group_idx'] = groups.applymap(lambda x: group_to_num.get(x)).max(axis=1).fillna(-1).astype(np.int32)
but it seems inelegant. Is there a simpler way of doing this?
You can try merge after a dataframe constructor:
groups.merge(pd.DataFrame(letters).reset_index(),left_on='group',right_on=0).\
rename(columns={'index':'group_idx'}).drop(0,1)
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
Use map:
import pandas as pd
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
groups['group_idx'] = groups.group.map(group_to_num)
print(groups)
Output
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
I have a dataset where two columns have almost perfect correlation, meaning when one column has a certain value there is very high chance that the second column will have another certain value, Example:
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'B': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'd', 'd', 'e', 'e', 'f', 'f', 'g']})
print(df)
Out[6]:
A B
0 1 a
1 1 a
2 1 a
3 1 a
4 1 a
5 1 a
6 2 b
7 2 c
8 3 d
9 3 d
10 4 e
11 4 e
12 5 f
13 5 f
14 5 g
When column A has value of 1, B will have a - that's a perfect correlation as there is no A value of 1 which will have a B value different than a. That is also the case with 3->d, 4->e.
5 and 2 are not perfectly correlated.
How can I find all the A values who has more than one matching B values so I could print them all out?
In this case, my desired output would be something like
find_imperfect_correlations(df, 'A', 'B')
Out[7]:
2 -> 'b', 'c'
5 -> 'f', 'g'
EDIT:
Preferably a generalized answer for when the dtype of B could be ints, dates, etc.
def find_imperfect_correlations(df, col1, col2):
df_out = df.groupby(col1).filter(lambda x: x[col2].nunique()>1).groupby(col1)[col2].apply(lambda x:x.unique())
for i in df_out.iteritems():
print(str(i[0]) + ' -> ' + str(i[1]))
find_imperfect_correlations(df, 'A', 'B')
Output:
2 -> ['b' 'c']
5 -> ['f' 'g']
IIUIC, you can
In [562]: s = df.groupby('A')['B'].unique()
In [563]: s[s.str.len() > 1]
Out[563]:
A
2 [b, c]
5 [f, g]
dtype: object
Or
In [564]: s[s.str.len() > 1].str.join(', ')
Out[564]:
A
2 b, c
5 f, g
dtype: object
This one would also work:
grouped = df.groupby('A').B.nunique()
df_grouped = grouped.to_frame().reset_index()
decorrelated = df_grouped[df_grouped['B'] > 1]
print(decorrelated['A'])
The first line counts the distinct values in column B for each value in column A. The second line converts the resulting series to a dataframe. The third line selects the rows where the number of distinct values is greater than 1. Then the last line prints the A values.
I have the following dataframe:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([[1, 2], ['a', 'b', 'c'], ['a', 'b', 'c']],
names=['one', 'two', 'three'])
df = pd.DataFrame(np.random.rand(18, 3), index=index)
0 1 2
one two three
1 a b 0.002568 0.390393 0.040717
c 0.943853 0.105594 0.738587
b b 0.049197 0.500431 0.001677
c 0.615704 0.051979 0.191894
2 a b 0.748473 0.479230 0.042476
c 0.691627 0.898222 0.252423
b b 0.270330 0.909611 0.085801
c 0.913392 0.519698 0.451158
I want to select rows where combination of index levels two and three are (a, b) or (b, c). How can I do this?
I tried df.loc[(slice(None), ['a', 'b'], ['b', 'c']), :] but that gives me all combinations of [a, b] and [b, c], including (a, c) and (b, b), which aren't needed.
I tried df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])] but that returns NaN in level one of the index.
df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])]
0 1 2
NaN a b NaN NaN NaN
b c NaN NaN NaN
So I thought I needed a slice at level one, but that gives me a TypeError:
pd.MultiIndex.from_tuples([(slice(None), 'a', 'b'), (slice(None), 'b', 'c')])
TypeError: unhashable type: 'slice'
I feel like I'm missing some simple one-liner here :).
Use df.query():
In [174]: df.query("(two=='a' and three=='b') or (two=='b' and three=='c')")
Out[174]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
UPDATE: we can also generate such "query" dynamically:
In [185]: l = [('a','b'), ('b','c')]
In [186]: q = ' or '.join(["(two=='{}' and three=='{}')".format(x,y) for x,y in l])
In [187]: q
Out[187]: "(two=='a' and three=='b') or (two=='b' and three=='c')"
In [188]: df.query(q)
Out[188]:
0 1 2
one two three
1 a b 0.211555 0.193317 0.623895
b c 0.685047 0.369135 0.899151
2 a b 0.082099 0.555929 0.524365
b c 0.901859 0.068025 0.742212
Here's one approach with loc and get_level_values
In [3231]: idx = df.index.get_level_values
In [3232]: df.loc[((idx('two') == 'a') & (idx('three') == 'b')) |
((idx('two') == 'b') & (idx('three') == 'c'))]
Out[3232]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461
Generic way
In [3262]: conds = [('a', 'b'), ('b', 'c')]
In [3263]: mask = np.column_stack(
[(idx('two') == c[0]) & (idx('three') == c[1]) for c in conds]
).any(1)
In [3264]: df.loc[mask]
Out[3264]:
0 1 2
one two three
1 a b 0.442332 0.380669 0.832598
b c 0.458145 0.017310 0.068655
2 a b 0.933427 0.148962 0.569479
b c 0.727993 0.172090 0.384461
I am dealing with pandas series like the following
x=pd.Series([1, 2, 1, 4, 2, 6, 7, 8, 1, 1], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
The indices are non unique, but will always map to the same value, for example 'a' always corresponds to '1' in my sample, b always maps to '2' etc. So if I want to see which values correspond to each index value I simply need to write
x.mean(level=0)
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64
The difficulty arises when the values are strings, I can't call 'mean()' on strings but I would still like to return a similar list in this case. Any ideas on a good way to do that?
x=pd.Series(['1', '2', '1', '4', '2', '6', '7', '8', '1', '1'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
So long as your indices map directly to the values then you can simply call drop_duplicates:
In [83]:
x.drop_duplicates()
Out[83]:
a 1
b 2
c 4
d 6
e 7
f 8
dtype: int64
example:
In [86]:
x = pd.Series(['XX', 'hello', 'XX', '4', 'hello', '6', '7', '8'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f'])
x
Out[86]:
a XX
b hello
a XX
c 4
b hello
d 6
e 7
f 8
dtype: object
In [87]:
x.drop_duplicates()
Out[87]:
a XX
b hello
c 4
d 6
e 7
f 8
dtype: object
EDIT a roundabout method would be to reset the index so that the index values are a new column, drop duplicates and then set the index back again:
In [100]:
x.reset_index().drop_duplicates().set_index('index')
Out[100]:
0
index
a 1
b 2
c 4
d 6
e 7
f 8
g 1
pandas.Series.values are numpy ndarrays. Perhaps doing a values.astype(int) would solve your problem?
You can also ensure that you're getting all of the unique indices without reshaping the array by getting a list of the unique index values and plugging that back into the index using iloc. Numpy's unique method includes a return_index arg which provides a tuple of (unique_values, indices):
In [3]: x.iloc[np.unique(x.index.values, return_index=True)[1]]
Out[3]:
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64