Selecting single values from pandas dataframe using lists - python

import numpy as np
import pandas as pd
ind = [0, 1, 2]
cols = ['A','B','C']
df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
Say you have a pandas dataframe df looking like:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
If you want to capture a single element from each column in cols at a specific index ind the output should look like a series:
A 0
B 4
C 8
What I've tried so far was:
df.loc[ind,cols]
which gives the undesired output:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Any suggestions?
context:
The next step would be mapping the output of an df.idxmax() call of one dataframe onto another dataframe with the same column names and indexes, but I can likely figure that out if I know how to do the above mentioned transformation .

you can use DataFrame.lookup():
In [6]: pd.Series(df.lookup(df.index, df.columns), index=df.columns)
Out[6]:
A 0
B 4
C 8
dtype: int32
or:
In [14]: pd.Series(df.lookup(ind, cols), index=df.columns)
Out[14]:
A 0
B 4
C 8
dtype: int32
Explanation:
In [12]: df.lookup(df.index, df.columns)
Out[12]: array([0, 4, 8])

Here's a vectorized one with NumPy's advanced-indexing to select one element per column, given the row indices ind per col -
pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Sample run -
In [107]: ind = [0, 2, 1] # different one than sample for variety
...: cols = ['A','B','C']
...: df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
...:
In [109]: df
Out[109]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
In [110]: pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Out[110]:
A 0
B 7
C 5
dtype: int64
Runtime test
Let's compare the propose one against the pandas built-in vectorized lookup method proposed in #MaxU's solution and since we are seeing how good the vectorized ones are, let's have greater number of cols -
In [111]: ncols = 10000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [112]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
1000 loops, best of 3: 718 µs per loop
# Proposed in this post
In [113]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
1000 loops, best of 3: 410 µs per loop
In [114]: ncols = 100000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [115]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
100 loops, best of 3: 8.83 ms per loop
# Proposed in this post
In [116]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
100 loops, best of 3: 5.76 ms per loop

There is another way using mutiIndex, if you like using .loc
df1=df.reset_index().melt('index').set_index(['index','variable'])
df1.loc[list(zip(df.index,df.columns))]
Out[118]:
value
index variable
0 A 0
1 B 4
2 C 8

There should be a more direct way but this is what I could think of,
val = [df.iloc[i,i] for i in df.index]
pd.Series(val, index = df.columns)
A 0
B 4
C 8
dtype: int64

You could zip the column and index values you would like to retrieve the values for and then create a series from that:
pd.Series([df.loc[id_, col] for id_, col in zip(ind, cols)], df.columns)
A 0
B 4
C 8
Or if you always just need the diagonal value:
pd.Series(np.diag(df), df.columns)
Will be much faster

Related

How to sort rows values and replace them by column names on a pandas dataframe

I would like to sort the values of each row and replace the values by column names.
Suppose we have the dataframe below.
ID A B C
1 8 10 9
2 6 7 8
3 13 14 7
I want it to be converted to this form.
1 B C A
2 c B A
3 B A C
Is there a way to do it in python?
I am thinking in something like this:
df.sort(0, ascending=False)
But it does not work for me.
You can use numpy.argsort, but first get column ID to index by set_index:
df = df.set_index('ID')
print ((np.argsort(-df.values, axis=1)))
[[1 2 0]
[2 1 0]
[1 0 2]]
print (df.columns[np.argsort(-df.values, axis=1)])
Index([['B', 'C', 'A'], ['C', 'B', 'A'], ['B', 'A', 'C']], dtype='object')
print (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)],
index=df.index))
0 1 2
ID
1 B C A
2 C B A
3 B A C
print (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)],
index=df.index).reset_index())
ID 0 1 2
0 1 B C A
1 2 C B A
2 3 B A C
If need set columns from original DataFrame:
print (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)],
index=df.index,
columns=df.columns))
A B C
ID
1 B C A
2 C B A
3 B A C
Timings:
#[3 rows x 3 columns]
In [97]: %timeit (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)],index=df.index, columns=df.columns))
10000 loops, best of 3: 126 µs per loop
In [98]: %timeit (df.apply(lambda row: row.sort_values(ascending=False).index, axis=1))
1000 loops, best of 3: 1.95 ms per loop
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
df = df.set_index('ID')
In [103]: %timeit (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)],index=df.index, columns=df.columns))
1000 loops, best of 3: 1.76 ms per loop
In [104]: %timeit (df.apply(lambda row: row.sort_values(ascending=False).index, axis=1))
1 loop, best of 3: 7.21 s per loop
The idea is to sort each row and take the resulting index.
df.apply(lambda row: row.sort_values(ascending=False).index, axis=1)
Note that when applying by row, the index of each row is the columns of the dataframe.

Pandas - count if multiple conditions

Having a dataframe in python:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B?
Result would be then:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1, return MIXED otherwise the TYPE by itself:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large).
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE and compute the relative frequencies of TYPE being A or B:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc looks like
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. Cutting said row into bins with pd.cut gives the desired result:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types series:
types.name = 'TYPE'
here is one bit ugly, but not that slow solution:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop

Creating new column in pandas dataframe with a list of values from another column without using "groupby"

I work with large datasets, making pandas group and groupby functions take a long time/use too much memory. I have heard some people say groupby can be slow, but am having trouble finding a better solution.
If my dataframe has 2 columns similar to:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
a b
1 1
2 1
2 1
4 1
I wish to return a list of values that match to a value in another column:
a b list_of_b
1 1 [1]
2 1 [1,1]
2 1 [1,1]
4 1 [1]
I currently use:
df_group = df.groupby('a')
df['list_of_b'] = df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
The code above works for small stuff, but not on large dataframes ( df > 1,000,000 rows) Does anyone have a faster way to do this?
Shortest solution I can think of:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
df.join(pd.Series(df.groupby(by='a').apply(lambda x: list(x.b)), name="list_of_b"), on='a')
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
On a 4K row df I get the following:
In [29]:
df_group = df.groupby('a')
​
%timeit df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
%timeit df['a'].map(df.groupby('a')['b'].apply(list))
1 loops, best of 3: 4.37 s per loop
100 loops, best of 3: 4.21 ms per loop
Just doing the grouping and then joining back to the original dataframe seems to be quite a bit faster:
def make_lists(df):
g = df.groupby('a')
def list_of_b(x):
return x.b.tolist()
return df.set_index('a').join(
pd.DataFrame(g.apply(list_of_b),
columns=['list_of_b']),
rsuffix='_').reset_index()
This gives me 192ms per loop with 1M rows generated like this:
df1 = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
low = 1
high = 10
size = 1000000
df2 = pd.DataFrame({'a':np.random.randint(low,high,size),
'b':np.random.randint(low,high,size)})
make_lists(df1)
Out[155]:
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
In [156]:
%%timeit
make_lists(df2)
10 loops, best of 3: 192 ms per loop

python pandas clean up empty rows after last row of data

I have a df like this:
t1 t2 t3
0 a b c
1 b
2
3
4 a b c
5 b
6
7
I want to drop all values after index 5 because it has no values, but not index 2,3. I will not know whether each column will have data or not.
All values are strings.
In [74]: df.iloc[:np.where(df.any(axis=1))[0][-1]+1]
Out[74]:
t1 t2 t3
10 a b c
11 b
12
13
14 a b c
15 b
Explanation: First find which rows contain something other than empty strings:
In [37]: df.any(axis=1)
Out[37]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
dtype: bool
Find the location of the rows which are True:
In [71]: np.where(df.any(axis=1))
Out[71]: (array([0, 1, 4, 5]),)
Find the largest index (which will also be the last):
In [72]: np.where(df.any(axis=1))[0][-1]
Out[72]: 5
Then you can use df.iloc to select all rows up to and including the index with value 5.
Note that the first method I suggested is not as robust; if your dataframe has
an index with repeated values, then selecting the rows with df.loc is
problematic.
The new method is also a bit faster:
In [75]: %timeit df.iloc[:np.where(df.any(axis=1))[0][-1]+1]
1000 loops, best of 3: 203 µs per loop
In [76]: %timeit df.loc[:df.any(axis=1).cumsum().argmax()]
1000 loops, best of 3: 296 µs per loop

Equivalent of Series.map for DataFrame?

Using Series.map with a Series argument, I can take the elements of a Series and use them as indices into another Series. I want to do the same thing with some columns of a DataFrame, using each row as a set of index levels into a MultiIndex-ed Series. Here is an example:
>>> d = pandas.DataFrame([["A", 1], ["B", 2], ["C", 3]], columns=["X", "Y"])
>>> d
X Y
0 A 1
1 B 2
2 C 3
[3 rows x 2 columns]
>>> s = pandas.Series(np.arange(9), index=pandas.MultiIndex.from_product([["A", "B", "C"], [1, 2, 3]]))
>>> s
A 1 0
2 1
3 2
B 1 3
2 4
3 5
C 1 6
2 7
3 8
dtype: int32
What I would like is to be able to do d.map(s), so that each row of d should be taken as a tuple to use to index into the MultiIndex of s. That is, I want the same result as this:
>>> s.ix[[("A", 1), ("B", 2), ("C", 3)]]
A 1 0
B 2 4
C 3 8
dtype: int32
However, DataFrame, unlike Series, has no map method. The other obvious alternative, s.ix[d], gives me the error "Cannot index with multidimensional key", so this is apparently not supported either.
I know I can do it by converting the DataFrame to a list of lists, or by using a row-wise apply to grab each item one by one, but isn't there any way to do it without that amount of overhead? How can I do the equivalent of Series.map on multiple columns at once?
You could create a MultiIndex from the DataFrame and the ix/loc using that:
In [11]: mi = pd.MultiIndex.from_arrays(d.values.T)
In [12]: s.loc[mi] # can use ix too
Out[12]:
A 1 0
B 2 4
C 3 8
dtype: int64
This is pretty efficient:
In [21]: s = pandas.Series(np.arange(1000*1000), index=pandas.MultiIndex.from_product([range(1000), range(1000)]))
In [22]: d = pandas.DataFrame(zip(range(1000), range(1000)), columns=["X", "Y"])
In [23]: %timeit mi = pd.MultiIndex.from_arrays(d.values.T); s.loc[mi]
100 loops, best of 3: 2.77 ms per loop
In [24]: %timeit s.apply(lambda x: x + 1) # at least compared to apply
1 loops, best of 3: 3.14 s per loop

Categories