I have a data frame with many columns with binaries representing the presence of a category in the observation. Each observation has exactly 3 categories with a value of 1, the rest 0. I want to create 3 new columns, 1 for each category, where the value is instead the name of the category (so the name of the binary column) if it's equal to one.To make it clearer:
I have:
x|y|z|k|w
---------
0|1|1|0|1
To be:
cat1|cat2|cat3
--------------
y |z |w
Can I do this ?
For better performance use numpy solution:
print (df)
x y z k w
0 0 1 1 0 1
1 1 1 0 0 1
c = df.columns.values
df = pd.DataFrame(c[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat')
print (df)
cat0 cat1 cat2
0 y z w
1 x y w
Details:
#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))
#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]
#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
[0 1 4]]
#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
['x' 'y' 'w']]
Timings:
df = pd.concat([df] * 1000, ignore_index=True)
#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop
#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop
#Zero solution working only with one row DataFrame, so not included
Here is one way:
import pandas as pd
df = pd.DataFrame({'x': [0, 1], 'y': [1, 1], 'z': [1, 0], 'k': [0, 1], 'w': [1, 1]})
split = df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()
df2 = pd.DataFrame(split)
# 0 1 2 3
# 0 w y z None
# 1 k w x y
You could
In [13]: pd.DataFrame([df.columns[df.astype(bool).values[0]]]).add_prefix('cat')
Out[13]:
cat0 cat1 cat2
0 y z w
Related
I have three columns: id (non-unique id), X (categories) and Y (categories). (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible)
I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code. I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)
For each row I check
whether the current X matches any of previous Xs (check_X = X[:row] == X[row])
whether the current Y matches any of previous Ys (check_Y = Y[:row] == Y[row])
whether the current id does not match any of previous ids (check_id = id[:row] != id[row])
if sum(check_X & check_Y & check_id)>0: then append 1 to the array
else: append 0
Your are probably looking for duplicated:
df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
'X': [1, 1, 2, 1, 1],
'Y': [2, 2, 2, 2, 2]})
df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)
# Output
id X Y dup
0 0 1 2 0
1 0 1 2 0
2 0 2 2 0
3 1 1 2 1
4 0 1 2 0
Update
X and Y should be checked separately:
df = pd.DataFrame({'id': [0, 1, 1, 2, 2, 3, 4],
'X': [0, 1, 1, 1, 1, 1, 1],
'Y': [0, 2, 2, 2, 2, 2, 2]})
df['dup'] = np.where(df['X'].duplicated() & df['Y'].duplicated() & ~df['id'].duplicated(), 1, 0)
print(df)
# Output
id X Y dup
0 0 0 0 0
1 1 1 2 0
2 1 1 2 0
3 2 1 2 1
4 2 1 2 0
5 3 1 2 1
6 4 1 2 1
EDIT answer from #Corralien using duplicates() will likely be much faster and the best answer for this specific problem. However, apply is more flexible if you have different things to check.
You could do it with iterrows() or apply(). As far as I know apply() is faster:
check_id, check_x, check_y = set(), set(), set()
def apply_func(row):
global check_id, check_x, check_y
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
row['duplicate'] = 1
else:
row['duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
return row
df.apply(apply_func, axis=1)
With iterrows():
check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
df.loc[i, 'duplicate'] = 1
else:
df.loc[i, 'duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11
I'm trying to mark (in ok) all groups in a pandas DataFrame which are smaller than 'N'. I have a working solution but it's slow, is there a way to speed this up?
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
df['ok'] = True
c = df.groupby(keys)['ok'].count()
for vals in c[c < N].index:
local_dict = dict(zip(keys, vals))
query = ' & '.join(f'{key}==#{key}' for key in keys)
idx = df.query(query, local_dict=local_dict).index
df.loc[idx, 'ok'] = False
print(df)
Instead of using groupby/count, use groupby/transform/count to form a Series which is the same length as the original DataFrame df:
c = df.groupby(keys)['z'].transform('count')
Then you can form a boolean mask which has the same length as df:
In [35]: c<N
Out[35]:
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
Name: ok, dtype: bool
Assignment to ok goes much more smoothly now, without a loop, querying or sub-indexing:
df['ok'] = c >= N
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
print(df)
yields
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True
Since the builtin groupby/transform methods (such as transform('count')) are
Cythonized they are in general faster than calling groupby/transform
with an custom lambda function.
Thus, computing the ok column in two steps using
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
is faster than
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
In addition, vectorized operations over an entire column (such as c >= N), are
faster than multiple operations over subgroups. transform(lambda x: x.size >=
N)) performs the comparison x.size >= N once for each group. If there are
many groups, then computing c >= N yields an improvement in performance.
For example, with this 1000-row DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2017)
df = pd.DataFrame(np.random.randint(10, size=(1000, 3)), columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
using transform('count') is about 12x faster:
In [37]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.69 ms per loop
In [38]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
1 loop, best of 3: 20.2 ms per loop
In [39]: 20.2/1.69
Out[39]: 11.95266272189349
In the example above there were 100 groups:
In [47]: df.groupby(keys).ngroups
Out[47]: 100
The speed advantage of using transform('count') increases as the number of
groups increase. For example, with 955 groups:
In [48]: np.random.seed(2017); df = pd.DataFrame(np.random.randint(100, size=(1000, 3)), columns=['x', 'y', 'z'])
In [51]: df.groupby(keys).ngroups
Out[51]: 955
the transform('count') method performs about 92x faster:
In [49]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.88 ms per loop
In [50]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
10 loops, best of 3: 173 ms per loop
In [52]: 173/1.88
Out[52]: 92.02127659574468
Input variables:
keys = ['x','y']
N = 3
Calculate okay or not with groupby, transform and size:
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
Output:
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True
How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.
You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.
I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3:
An issue has been logged
https://github.com/pydata/pandas/issues/10417
With pandas 0.16.2 and Python 3, I was able to get the correct result via:
x.stack().reset_index().groupby(['level_0','level_1']).sum()
Which produces:
0
level_0 level_1
11 x 2
y 6
12 x 6
y 10
You can then change the index and column names to more desirable ones using reindex() and columns.
Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.
To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:
x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()
Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:
In[1]: obj = x.stack()
type(obj)
Out[1]: pandas.core.series.Series
In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
vs.
In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
type(obj)
Out[3]: pandas.core.frame.DataFrame
In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['level_0', 'level_1'])
Notice how the MultiIndex on the DataFrame describes the levels more correctly.
sum allows you to specify the levels to sum over in a MultiIndex data frame.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
y.sum(level=[0,1])
11 x 2
y 6
12 x 6
y 10
Using Pandas 0.15.2, you just need one more iteration of groupby
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
prints
11 x 2
y 6
12 x 6
y 10