Suppose I have 2 DataFrames:
DataFrame 1
A B
a 1
b 2
c 3
d 4
DataFrame2:
C D
a c
b a
a b
The goal is to add a column to DataFrame 2 ('E').
C D E
a c (1-3=-2)
b a (2-1=1)
a b (1-2=-1)
If this were excel, a formula could be something similar to "=vlookup(A1,DataFrame1,2)-vlookup(B1,DataFrame1,2)". Any idea what this formula looks like in Python?
Thanks!
A Pandas Series can be thought of as a mapping from its index to its values.
Here, we wish to use the first DataFrame, df1 as a mapping from column A to column B. So the natural thing to do is to convert df1 into a Series:
s = df1.set_index('A')['B']
# A
# a 0
# b 1
# c 2
# d 3
# Name: B, dtype: int64
Now we can use the Series.map method to "lookup" values in a Series based on s:
import pandas as pd
df1 = pd.DataFrame({'A':list('abcd'), 'B':[1,2,3,4]})
df2 = pd.DataFrame({'C':list('aba'), 'D':list('cab')})
s = df1.set_index('A')['B']
df2['E'] = df2['C'].map(s) - df2['D'].map(s)
print(df2)
yields
C D E
0 a c -2
1 b a 1
2 a b -1
You can do something like this:
#set column A as index, so you can index it
df1 = df1.set_index('A')
df2['E'] = df1.loc[df2.C, 'B'].values - df1.loc[df2.D, 'B'].values
And the result is:
C D E
0 a c -2
1 b a 1
2 a b -1
Hope it helps :)
Option 1
Using replace and eval with assign
df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
C D E
0 a c -2
1 b a 1
2 a b -1
I like this answer for it's succinctness.
I use replace with two iterables, nameley df1.A that specifies what to replace and df1.B that specifies what to replace with.
I use eval to elegantly perform the differencing of the new found C less D.
I use assign to create a copy of df2 with a new column named E that has the values from the steps above.
I could have used a dictionary instead dict(zip(df1.A, df1.B))
df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
C D E
0 a c -2
1 b a 1
2 a b -1
PROJECT/kill
numpy + pd.factorize
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel()
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]]
df2.assign(E=look[::2] - look[1::2])
C D E
0 a c -2
1 b a 1
2 a b -1
Timing
Among the pure pandas #unutbu's answer is the clear winner. While my overly verbose numpy solution only improves by about 40ish%
Let's use these functions for the numpy versions. Note using_F_order is #unutbu's contribution.
def using_numpy(df1, df2):
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel()
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]]
return df2.assign(E=look[::2] - look[1::2])
def using_F_order(df1, df2):
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel(order='F')
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]].reshape(-1, 2, order='F')
return df2.assign(E=look[:, 0]-look[:, 1])
small data
%timeit df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
%timeit df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
%timeit df2.assign(E=(lambda s: df2['C'].map(s) - df2['D'].map(s))(df1.set_index('A')['B']))
%timeit using_numpy(df1, df2)
%timeit using_F_order(df1, df2)
100 loops, best of 3: 2.31 ms per loop
100 loops, best of 3: 2.44 ms per loop
1000 loops, best of 3: 1.25 ms per loop
1000 loops, best of 3: 436 µs per loop
1000 loops, best of 3: 424 µs per loop
large data
from string import ascii_lowercase, ascii_uppercase
import pandas as pd
import numpy as np
upper = np.array(list(ascii_uppercase))
lower = np.array(list(ascii_lowercase))
ch = np.core.defchararray.add(upper[:, None], lower).ravel()
np.random.seed([3,1415])
n = 100000
df1 = pd.DataFrame(dict(A=ch, B=np.arange(ch.size)))
df2 = pd.DataFrame(dict(C=np.random.choice(ch, n), D=np.random.choice(ch, n)))
%timeit df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
%timeit df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
%timeit df2.assign(E=(lambda s: df2['C'].map(s) - df2['D'].map(s))(df1.set_index('A')['B']))
%timeit using_numpy(df1, df2)
%timeit using_F_order(df1, df2)
1 loop, best of 3: 11.1 s per loop
1 loop, best of 3: 10.6 s per loop
100 loops, best of 3: 17.7 ms per loop
100 loops, best of 3: 10.9 ms per loop
100 loops, best of 3: 9.11 ms per loop
Here's a very simple way to achieve this:
newdf = df2.replace(['a','b','c','d'],[1,2,3,4])
df2['E'] = newdf['C'] - newdf['D']
df2
I hope this helps !
Related
How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Using stack and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))\
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15
What is the best way to do string matching on a column of lists?
E.g. I have a dataset:
import numpy as np
import pandas as pd
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':xrange(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in xrange(3)]})
df
L id
0 [tackle, apple, grapple] 0
1 [tackle, snapple, satchel] 1
2 [satchel, satchel, tackle] 2
And I want to return the rows where any item in L matches a string, e.g. 'grap' should return row 0, and 'sat' should return rows 1:2.
Let's use this:
np.random.seed(123)
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':range(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in range(3)]})
df
L id
0 [tackle, snapple, tackle] 0
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
Use any and apply:
df[df.L.apply(lambda x: any('grap' in s for s in x))]
Output:
L id
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
Timings:
%timeit df.L.apply(lambda x: any('grap' in s for s in x))
10000 loops, best of 3: 194 µs per loop
%timeit df.L.apply(lambda i: ','.join(i)).str.contains('grap')
1000 loops, best of 3: 481 µs per loop
%timeit df.L.str.join(', ').str.contains('grap')
1000 loops, best of 3: 529 µs per loop
df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]
I have a data frame look like below:
mydata = [{'col_A' : 'A', 'col_B': [1,2,3]},
{'col_A' : 'B', 'col_B': [7,8]}]
pd.DataFrame(mydata)
col_A col_B
A [1, 2, 3]
B [7, 8]
How to split the value in the list and create a data frame that look like this:
col_A col_B
A 1
A 2
A 3
B 7
B 8
Try this:
pd.DataFrame([{'col_A':row['col_A'], 'col_B':val}
for ind, row in df.iterrows()
for val in row['col_B']])
You might also be able to do something clever with the apply() function, but off the top of my head, I can think of how.
Here is a solution using apply:
df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index(level=0)
col_A 0
0 A 1
1 A 2
2 A 3
3 B 7
4 B 8
If your DataFrame is big, the fastest is use DataFrame constructor with stack and double reset_index:
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack()
.reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
Testing:
import pandas as pd
mydata = [{'col_A' : 'A', 'col_B': [1,2,3]},
{'col_A' : 'B', 'col_B': [7,8]}]
df = pd.DataFrame(mydata)
print df
df = pd.concat([df]*1000).reset_index(drop=True)
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
print df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
print pd.DataFrame([{'col_A':row['col_A'], 'col_B':val} for ind, row in df.iterrows() for val in row['col_B']])
Timing:
In [1657]: %timeit pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
100 loops, best of 3: 4.01 ms per loop
In [1658]: %timeit pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
100 loops, best of 3: 3.09 ms per loop
In [1659]: %timeit pd.DataFrame([{'col_A':row['col_A'], 'col_B':val} for ind, row in df.iterrows() for val in row['col_B']])
10 loops, best of 3: 153 ms per loop
In [1660]: %timeit df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
1 loops, best of 3: 357 ms per loop
Is there a simple way to append a pandas series to another and update its index ? Currently i have two series
from numpy.random import randn
from pandas import Series
a = Series(randn(5))
b = Series(randn(5))
and i can append b to a by
a.append(b)
>>> 0 -0.191924
1 0.577687
2 0.332826
3 -0.975678
4 -1.536626
0 0.828700
1 0.636592
2 0.376864
3 0.890187
4 0.226657
but is there a smarter way to make sure that i have a continuous index than
a=Series(randn(5))
b=Series(randn(5),index=range(len(a),len(a)+len(b)))
a.append(b)
One option is to use reset_index:
>>> a.append(b).reset_index(drop=True)
0 -0.370406
1 0.963356
2 -0.147239
3 -0.468802
4 0.057374
5 -1.113767
6 1.255247
7 1.207368
8 -0.460326
9 -0.685425
dtype: float64
For the sake of justice, Roman Pekar method is fastest:
>>> timeit('from __main__ import np, pd, a, b; pd.Series(np.concatenate([a,b]))', number = 10000)
0.6133969540821536
>>> timeit('from __main__ import np, pd, a, b; pd.concat([a, b], ignore_index=True)', number = 10000)
1.020389742271714
>>> timeit('from __main__ import np, pd, a, b; a.append(b).reset_index(drop=True)', number = 10000)
2.2282133623128075
you can also use concat with ignore_index=True: (see docs )
pd.concat([a, b], ignore_index=True)
edit: my tests with larger a and b:
a = pd.Series(pd.np.random.randn(100000))
b = pd.Series(pd.np.random.randn(100000))
%timeit pd.Series(np.concatenate([a,b]))
1000 loops, best of 3: 1.05 ms per loop
%timeit pd.concat([a, b], ignore_index=True)
1000 loops, best of 3: 1.07 ms per loop
%timeit a.append(b).reset_index(drop=True)
100 loops, best of 3: 5.11 ms per loop
I think #runnerup answer is the way to go, but you can also create new Series explicitly:
>>> pd.Series(np.concatenate([a,b]))
0 -0.200403
1 -0.921215
2 -1.338854
3 1.093929
4 -0.879571
5 -0.810333
6 1.654075
7 0.360978
8 -0.472407
9 0.123393
dtype: float64
I want to replace negative values in a pandas DataFrame column with zero.
Is there a more concise way to construct this expression?
df['value'][df['value'] < 0] = 0
You could use the clip method:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': np.arange(-5,5)})
df['value'] = df['value'].clip(0, None)
print(df)
yields
value
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
Another possibility is numpy.maximum(). This is more straight-forward to read in my opinion.
import pandas as pd
import numpy as np
df['value'] = np.maximum(df.value, 0)
It's also significantly faster than all other methods.
df_orig = pd.DataFrame({'value': np.arange(-1000000, 1000000)})
df = df_orig.copy()
%timeit df['value'] = np.maximum(df.value, 0)
# 100 loops, best of 3: 8.36 ms per loop
df = df_orig.copy()
%timeit df['value'] = np.where(df.value < 0, 0, df.value)
# 100 loops, best of 3: 10.1 ms per loop
df = df_orig.copy()
%timeit df['value'] = df.value.clip(0, None)
# 100 loops, best of 3: 14.1 ms per loop
df = df_orig.copy()
%timeit df['value'] = df.value.clip_lower(0)
# 100 loops, best of 3: 14.2 ms per loop
df = df_orig.copy()
%timeit df.loc[df.value < 0, 'value'] = 0
# 10 loops, best of 3: 62.7 ms per loop
(notebook)
Here is the canonical way of doing it, while not necessarily more concise, is more flexible (in that you can apply this to arbitrary columns)
In [39]: df = DataFrame(randn(5,1),columns=['value'])
In [40]: df
Out[40]:
value
0 0.092232
1 -0.472784
2 -1.857964
3 -0.014385
4 0.301531
In [41]: df.loc[df['value']<0,'value'] = 0
In [42]: df
Out[42]:
value
0 0.092232
1 0.000000
2 0.000000
3 0.000000
4 0.301531
Or where to check:
>>> import pandas as pd,numpy as np
>>> df = pd.DataFrame(np.random.randn(5,1),columns=['value'])
>>> df
value
0 1.193313
1 -1.011003
2 -0.399778
3 -0.736607
4 -0.629540
>>> df['value']=df['value'].where(df['value']>0,0)
>>> df
value
0 1.193313
1 0.000000
2 0.000000
3 0.000000
4 0.000000
>>>
For completeness, np.where is also a possibility, which is faster than most answers here. The np.maximum answer is the best approach though, as it's faster and more concise than this.
df['value'] = np.where(df.value < 0, 0, df.value)
Let's take only values greater than zero, leaving those which are negative as NaN (works with frames not with series), then impute.
df[df > 0].fillna(0)