Using Series.map with a Series argument, I can take the elements of a Series and use them as indices into another Series. I want to do the same thing with some columns of a DataFrame, using each row as a set of index levels into a MultiIndex-ed Series. Here is an example:
>>> d = pandas.DataFrame([["A", 1], ["B", 2], ["C", 3]], columns=["X", "Y"])
>>> d
X Y
0 A 1
1 B 2
2 C 3
[3 rows x 2 columns]
>>> s = pandas.Series(np.arange(9), index=pandas.MultiIndex.from_product([["A", "B", "C"], [1, 2, 3]]))
>>> s
A 1 0
2 1
3 2
B 1 3
2 4
3 5
C 1 6
2 7
3 8
dtype: int32
What I would like is to be able to do d.map(s), so that each row of d should be taken as a tuple to use to index into the MultiIndex of s. That is, I want the same result as this:
>>> s.ix[[("A", 1), ("B", 2), ("C", 3)]]
A 1 0
B 2 4
C 3 8
dtype: int32
However, DataFrame, unlike Series, has no map method. The other obvious alternative, s.ix[d], gives me the error "Cannot index with multidimensional key", so this is apparently not supported either.
I know I can do it by converting the DataFrame to a list of lists, or by using a row-wise apply to grab each item one by one, but isn't there any way to do it without that amount of overhead? How can I do the equivalent of Series.map on multiple columns at once?
You could create a MultiIndex from the DataFrame and the ix/loc using that:
In [11]: mi = pd.MultiIndex.from_arrays(d.values.T)
In [12]: s.loc[mi] # can use ix too
Out[12]:
A 1 0
B 2 4
C 3 8
dtype: int64
This is pretty efficient:
In [21]: s = pandas.Series(np.arange(1000*1000), index=pandas.MultiIndex.from_product([range(1000), range(1000)]))
In [22]: d = pandas.DataFrame(zip(range(1000), range(1000)), columns=["X", "Y"])
In [23]: %timeit mi = pd.MultiIndex.from_arrays(d.values.T); s.loc[mi]
100 loops, best of 3: 2.77 ms per loop
In [24]: %timeit s.apply(lambda x: x + 1) # at least compared to apply
1 loops, best of 3: 3.14 s per loop
Related
Having a DataFrame (or Series) consisting of lists, looking like this:
df = pd.DataFrame([[[1,3], [2,3,4], [1,4,2,5]]], columns=['A', 'B', 'C']).T
print(df)
Output:
0
A [1, 3]
B [2, 3, 4]
C [1, 4, 2, 5]
How can I transform it into
0
A 1
A 2
B 2
B 3
B 4
C 1
C 4
C 2
C 5
I've tried to use apply() but that didn't quite work. Can I implicitly convert that? I also tried to extract all number as tuples [('A', 1), ('A', 3), ..] for from_records() but I wasn't able to do that as well.
I think I could do it like this:
pd.DataFrame.from_records(df[0].map(lambda x: [(0, v) for v in x]).sum())
but I don't know how to access the index.. note (0, v) should actually be something like (x.index, v).
Need flattening values in column and then repeat index by len of lists:
df = pd.DataFrame({0:np.concatenate(df.iloc[:, 0].values.tolist())},
index=df.index.repeat(df[0].str.len()))
from itertools import chain
df=pd.DataFrame({0:list(chain.from_iterable(df.iloc[:, 0].values.tolist()))},
index=df.index.repeat(df[0].str.len()))
print (df)
0
A 1
A 3
B 2
B 3
B 4
C 1
C 4
C 2
C 5
Timings:
np.random.seed(456)
N = 100000
a = [list(range(np.random.randint(5, 20))) for _ in range(N)]
L = list('abcdefghijklmno')
df = pd.DataFrame({0:a}, index=np.random.choice(L, size=N))
print (df)
In [348]: %timeit pd.DataFrame({0:np.concatenate(df.iloc[:, 0].values.tolist())}, index=df.index.repeat(df[0].str.len()))
1 loop, best of 3: 218 ms per loop
In [349]: %timeit pd.DataFrame({0:list(chain.from_iterable(df[0].values.tolist()))}, index=df.index.repeat(df[0].str.len()))
1 loop, best of 3: 388 ms per loop
In [350]: %timeit pd.DataFrame(df.iloc[:, 0].tolist(), index=df.index).stack().reset_index(level=1, drop=1).to_frame().astype(int)
1 loop, best of 3: 384 ms per loop
Use the pd.DataFrame + stack + reset_index + to_frame:
df = pd.DataFrame(df.iloc[:, 0].tolist(), index=df.index)\
.stack().reset_index(level=1, drop=1).to_frame()
df
0
A 1.0
A 3.0
B 2.0
B 3.0
B 4.0
C 1.0
C 4.0
C 2.0
C 5.0
import numpy as np
import pandas as pd
ind = [0, 1, 2]
cols = ['A','B','C']
df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
Say you have a pandas dataframe df looking like:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
If you want to capture a single element from each column in cols at a specific index ind the output should look like a series:
A 0
B 4
C 8
What I've tried so far was:
df.loc[ind,cols]
which gives the undesired output:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Any suggestions?
context:
The next step would be mapping the output of an df.idxmax() call of one dataframe onto another dataframe with the same column names and indexes, but I can likely figure that out if I know how to do the above mentioned transformation .
you can use DataFrame.lookup():
In [6]: pd.Series(df.lookup(df.index, df.columns), index=df.columns)
Out[6]:
A 0
B 4
C 8
dtype: int32
or:
In [14]: pd.Series(df.lookup(ind, cols), index=df.columns)
Out[14]:
A 0
B 4
C 8
dtype: int32
Explanation:
In [12]: df.lookup(df.index, df.columns)
Out[12]: array([0, 4, 8])
Here's a vectorized one with NumPy's advanced-indexing to select one element per column, given the row indices ind per col -
pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Sample run -
In [107]: ind = [0, 2, 1] # different one than sample for variety
...: cols = ['A','B','C']
...: df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
...:
In [109]: df
Out[109]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
In [110]: pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Out[110]:
A 0
B 7
C 5
dtype: int64
Runtime test
Let's compare the propose one against the pandas built-in vectorized lookup method proposed in #MaxU's solution and since we are seeing how good the vectorized ones are, let's have greater number of cols -
In [111]: ncols = 10000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [112]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
1000 loops, best of 3: 718 µs per loop
# Proposed in this post
In [113]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
1000 loops, best of 3: 410 µs per loop
In [114]: ncols = 100000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [115]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
100 loops, best of 3: 8.83 ms per loop
# Proposed in this post
In [116]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
100 loops, best of 3: 5.76 ms per loop
There is another way using mutiIndex, if you like using .loc
df1=df.reset_index().melt('index').set_index(['index','variable'])
df1.loc[list(zip(df.index,df.columns))]
Out[118]:
value
index variable
0 A 0
1 B 4
2 C 8
There should be a more direct way but this is what I could think of,
val = [df.iloc[i,i] for i in df.index]
pd.Series(val, index = df.columns)
A 0
B 4
C 8
dtype: int64
You could zip the column and index values you would like to retrieve the values for and then create a series from that:
pd.Series([df.loc[id_, col] for id_, col in zip(ind, cols)], df.columns)
A 0
B 4
C 8
Or if you always just need the diagonal value:
pd.Series(np.diag(df), df.columns)
Will be much faster
When I am using Pandas, I have a problem. My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
what I want to do is the output dataframe looks like this:
Out:
s1 s2 s3
0 3 7 11
1 3 7 11
2 3 7 11
That is to say, sum the column (a,b),(c,d),(e,f) separately and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.
1) Perform groupby w.r.t columns by supplying axis=1. Per #Boud's comment, you exactly get what you want with a minor tweak in the grouping array:
df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
Grouping gets performed according to this condition:
np.arange(len(df.columns)) // 2
# array([0, 0, 1, 1, 2, 2], dtype=int32)
2) Use np.add.reduceat which is a faster alternative:
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
df.add_prefix('s')
Timing Constraints:
For a DF of 1 million rows spanned over 20 columns:
from string import ascii_lowercase
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, (10**6,20)), columns=list(ascii_lowercase[:20]))
df.shape
(1000000, 20)
def with_groupby(df):
return df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
def with_reduceat(df):
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
return df.add_prefix('s')
# test whether they give the same o/p
with_groupby(df).equals(with_groupby(df))
True
%timeit with_groupby(df.copy())
1 loop, best of 3: 1.11 s per loop
%timeit with_reduceat(df.copy()) # <--- (>3X faster)
1 loop, best of 3: 345 ms per loop
Having a dataframe in python:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B?
Result would be then:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1, return MIXED otherwise the TYPE by itself:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large).
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE and compute the relative frequencies of TYPE being A or B:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc looks like
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. Cutting said row into bins with pd.cut gives the desired result:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types series:
types.name = 'TYPE'
here is one bit ugly, but not that slow solution:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop
I work with large datasets, making pandas group and groupby functions take a long time/use too much memory. I have heard some people say groupby can be slow, but am having trouble finding a better solution.
If my dataframe has 2 columns similar to:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
a b
1 1
2 1
2 1
4 1
I wish to return a list of values that match to a value in another column:
a b list_of_b
1 1 [1]
2 1 [1,1]
2 1 [1,1]
4 1 [1]
I currently use:
df_group = df.groupby('a')
df['list_of_b'] = df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
The code above works for small stuff, but not on large dataframes ( df > 1,000,000 rows) Does anyone have a faster way to do this?
Shortest solution I can think of:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
df.join(pd.Series(df.groupby(by='a').apply(lambda x: list(x.b)), name="list_of_b"), on='a')
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
On a 4K row df I get the following:
In [29]:
df_group = df.groupby('a')
%timeit df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
%timeit df['a'].map(df.groupby('a')['b'].apply(list))
1 loops, best of 3: 4.37 s per loop
100 loops, best of 3: 4.21 ms per loop
Just doing the grouping and then joining back to the original dataframe seems to be quite a bit faster:
def make_lists(df):
g = df.groupby('a')
def list_of_b(x):
return x.b.tolist()
return df.set_index('a').join(
pd.DataFrame(g.apply(list_of_b),
columns=['list_of_b']),
rsuffix='_').reset_index()
This gives me 192ms per loop with 1M rows generated like this:
df1 = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
low = 1
high = 10
size = 1000000
df2 = pd.DataFrame({'a':np.random.randint(low,high,size),
'b':np.random.randint(low,high,size)})
make_lists(df1)
Out[155]:
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
In [156]:
%%timeit
make_lists(df2)
10 loops, best of 3: 192 ms per loop