I am dealing with pandas series like the following
x=pd.Series([1, 2, 1, 4, 2, 6, 7, 8, 1, 1], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
The indices are non unique, but will always map to the same value, for example 'a' always corresponds to '1' in my sample, b always maps to '2' etc. So if I want to see which values correspond to each index value I simply need to write
x.mean(level=0)
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64
The difficulty arises when the values are strings, I can't call 'mean()' on strings but I would still like to return a similar list in this case. Any ideas on a good way to do that?
x=pd.Series(['1', '2', '1', '4', '2', '6', '7', '8', '1', '1'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
So long as your indices map directly to the values then you can simply call drop_duplicates:
In [83]:
x.drop_duplicates()
Out[83]:
a 1
b 2
c 4
d 6
e 7
f 8
dtype: int64
example:
In [86]:
x = pd.Series(['XX', 'hello', 'XX', '4', 'hello', '6', '7', '8'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f'])
x
Out[86]:
a XX
b hello
a XX
c 4
b hello
d 6
e 7
f 8
dtype: object
In [87]:
x.drop_duplicates()
Out[87]:
a XX
b hello
c 4
d 6
e 7
f 8
dtype: object
EDIT a roundabout method would be to reset the index so that the index values are a new column, drop duplicates and then set the index back again:
In [100]:
x.reset_index().drop_duplicates().set_index('index')
Out[100]:
0
index
a 1
b 2
c 4
d 6
e 7
f 8
g 1
pandas.Series.values are numpy ndarrays. Perhaps doing a values.astype(int) would solve your problem?
You can also ensure that you're getting all of the unique indices without reshaping the array by getting a list of the unique index values and plugging that back into the index using iloc. Numpy's unique method includes a return_index arg which provides a tuple of (unique_values, indices):
In [3]: x.iloc[np.unique(x.index.values, return_index=True)[1]]
Out[3]:
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64
Related
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
I have a list of group IDs:
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
and a dataframe of groups:
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
I'd like to create a column in the dataframe that gives the position of the group ids in the list, like so:
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
My current solution is this:
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups['group_idx'] = groups.applymap(lambda x: group_to_num.get(x)).max(axis=1).fillna(-1).astype(np.int32)
but it seems inelegant. Is there a simpler way of doing this?
You can try merge after a dataframe constructor:
groups.merge(pd.DataFrame(letters).reset_index(),left_on='group',right_on=0).\
rename(columns={'index':'group_idx'}).drop(0,1)
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
Use map:
import pandas as pd
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
groups['group_idx'] = groups.group.map(group_to_num)
print(groups)
Output
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
I have a dataset where two columns have almost perfect correlation, meaning when one column has a certain value there is very high chance that the second column will have another certain value, Example:
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'B': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'd', 'd', 'e', 'e', 'f', 'f', 'g']})
print(df)
Out[6]:
A B
0 1 a
1 1 a
2 1 a
3 1 a
4 1 a
5 1 a
6 2 b
7 2 c
8 3 d
9 3 d
10 4 e
11 4 e
12 5 f
13 5 f
14 5 g
When column A has value of 1, B will have a - that's a perfect correlation as there is no A value of 1 which will have a B value different than a. That is also the case with 3->d, 4->e.
5 and 2 are not perfectly correlated.
How can I find all the A values who has more than one matching B values so I could print them all out?
In this case, my desired output would be something like
find_imperfect_correlations(df, 'A', 'B')
Out[7]:
2 -> 'b', 'c'
5 -> 'f', 'g'
EDIT:
Preferably a generalized answer for when the dtype of B could be ints, dates, etc.
def find_imperfect_correlations(df, col1, col2):
df_out = df.groupby(col1).filter(lambda x: x[col2].nunique()>1).groupby(col1)[col2].apply(lambda x:x.unique())
for i in df_out.iteritems():
print(str(i[0]) + ' -> ' + str(i[1]))
find_imperfect_correlations(df, 'A', 'B')
Output:
2 -> ['b' 'c']
5 -> ['f' 'g']
IIUIC, you can
In [562]: s = df.groupby('A')['B'].unique()
In [563]: s[s.str.len() > 1]
Out[563]:
A
2 [b, c]
5 [f, g]
dtype: object
Or
In [564]: s[s.str.len() > 1].str.join(', ')
Out[564]:
A
2 b, c
5 f, g
dtype: object
This one would also work:
grouped = df.groupby('A').B.nunique()
df_grouped = grouped.to_frame().reset_index()
decorrelated = df_grouped[df_grouped['B'] > 1]
print(decorrelated['A'])
The first line counts the distinct values in column B for each value in column A. The second line converts the resulting series to a dataframe. The third line selects the rows where the number of distinct values is greater than 1. Then the last line prints the A values.
I have a Dataframe that looks like this:
OwnerID Value
1 A
1 B
1 C
1 D
This is the shortened version, I have thousands of values for OwnerID. I'd like to create pairs for the Value column where each Value is paired with every other Value, and have the result as list of pairs.
For example, for the OwnerID 1, the resultset should be the following lists:
[A,B]
[A,C]
[A,D]
[B,C]
[B,D]
[C,D]
I could write 2 for loops to achieve this, but that wouldn't be very efficient or pythonic. Would someone know a better way to achieve this?
Any help would be much appreciated.
Pandas solution (using .merge() and .query() methods):
Data:
In [10]: df
Out[10]:
OwnerID Value
0 1 A
1 1 B
2 1 C
3 1 D
4 2 X
5 2 Y
6 2 Z
Solution:
In [9]: pd.merge(df, df, on='OwnerID', suffixes=['','2']).query("Value != Value2")
Out[9]:
OwnerID Value Value2
1 1 A B
2 1 A C
3 1 A D
4 1 B A
6 1 B C
7 1 B D
8 1 C A
9 1 C B
11 1 C D
12 1 D A
13 1 D B
14 1 D C
17 2 X Y
18 2 X Z
19 2 Y X
21 2 Y Z
22 2 Z X
23 2 Z Y
If you need only lists:
In [17]: pd.merge(df, df, on='OwnerID', suffixes=['','2']) \
.query("Value != Value2") \
.filter(like='Value').values
Out[17]:
array([['A', 'B'],
['A', 'C'],
['A', 'D'],
['B', 'A'],
['B', 'C'],
['B', 'D'],
['C', 'A'],
['C', 'B'],
['C', 'D'],
['D', 'A'],
['D', 'B'],
['D', 'C'],
['X', 'Y'],
['X', 'Z'],
['Y', 'X'],
['Y', 'Z'],
['Z', 'X'],
['Z', 'Y']], dtype=object)
import itertools as iter
df2 = df.groupby('OwnerID').Value.apply(lambda x: list(iter.combinations(x, 2)))
will return the desired output for each unique owner id
OwnerID
1 [(A, B), (A, C), (A, D), (B, C), (B, D), (C, D)]
itertools is all you need.
Depending on how you want to combine them, try either permutations or combinations, for example.
try itertools
import itertools
list(itertools.combinations(['a','b','c','d'], 2))
#result: [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]
val = df['Value'].values
length = len(val)
pairs = [[val[i],val[j]] for i in xrange(length) for j in xrange(length) if i!=j]
I'm working with a 2D list of numbers similar to the example below I and am trying to reorder the columns:
D C B A
1 3 2 0
1 3 2 0
1 3 2 0
The first row of the list is reserved for letters to reference each column.
How can I sort this list so that these columns are placed in alphabetical order to achieve the following:
D C B A A B C D
1 3 2 0 0 2 3 1
1 3 2 0 0 2 3 1
1 3 2 0 0 2 3 1
I've found examples that make use of lambdas for sorting, but have not found any similar examples that sort columns by characters.
I'm not sure how to achieve this sorting and would really appreciate some help.
zip() the 2D list, sort by the first item, then zip() again.
>>> table = [['D', 'C', 'B', 'A',],
... [1, 3, 2, 0,],
... [1, 3, 2, 0],
... [1, 3, 2, 0]]
>>> for row in zip(*sorted(zip(*table), key=lambda x: x[0])):
... print(*row)
...
A B C D
0 2 3 1
0 2 3 1
0 2 3 1
Assume values stored row-by-row in list, like that:
a = [['D', 'C', 'B', 'A'],
['1', '3', '2', '0'],
['1', '3', '2', '0']]
To sort this array you can use following code:
zip(*sorted(zip(*a), key=lambda column: column[0]))
where column[0] - value to be sorted by (you can use column1 etc.)
Output:
[('A', 'B', 'C', 'D'),
('0', '2', '3', '1'),
('0', '2', '3', '1')]
Note:
If you are working with pretty big arrays and execution time does matter, consider using numpy, it has appropriate method: NumPy sort