I have a dataframe like this
L1 L2 L3 L4 L5
A 1 2 3 4 5
B 1 2 4 3 5
C 1 3 3 2 1
I want to calculate the number of differences between rows, for example the number of differences between A and B is 2, A and C is 3, B and C is 4.
What I really want is a difference matrix, such as
A B C
A 0 2 3
B 2 0 4
C 3 4 0
First loop solution is iterate by each row, compare by DataFrame and sum:
df = df.apply(lambda x: df.ne(x).sum(axis=1), axis=1)
print (df)
A B C
A 0 2 3
B 2 0 4
C 3 4 0
Or for improve performance are compared values in numpy with broadcasting for 3d array, sum and last is used DataFrame constructor:
a = df.to_numpy()
out = pd.DataFrame((a != a[:, None]).sum(2), index=df.index, columns=df.index)
print (out)
A B C
A 0 2 3
B 2 0 4
C 3 4 0
np.random.seed(123)
df = pd.DataFrame( np.random.randint(20, size=(100, 500)))
print (df)
In [119]: %%timeit
...: df.apply(lambda x: df.ne(x).sum(axis=1), axis=1)
...:
...:
12.8 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [120]: %%timeit
...: a = df.to_numpy()
...: pd.DataFrame((a != a[:, None]).sum(2), index=df.index, columns=df.index)
...:
...:
14.6 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
i have below data frame and want to do loop:
df = name
a
b
c
d
i have tried below code:
for index, row in df.iterrows():
for line in df['name']:
print(index, line)
but the result i want is a dataframe as below:
df = name name1
a a
a b
a c
a d
b a
b b
b c
b d
etc.
is there any possible way to do it? i know its a stupid question but im new to python
One way using pandas.DataFrame.explode:
df["name1"] = [df["name"] for _ in df["name"]]
df.explode("name1")
Output:
name name1
0 a a
0 a b
0 a c
0 a d
1 b a
1 b b
1 b c
1 b d
2 c a
2 c b
2 c c
2 c d
3 d a
3 d b
3 d c
3 d d
Fastest solution in numpy, thank you #Ch3steR:
df = pd.DataFrame({'name':np.repeat(df['name'],len(df)),
'name1':np.tile(df['name'],len(df))}
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
#oldier pandas versions
#df = pd.DataFrame(list(product(df['name'], df['name'])), columns=['name','name1'])
print (df)
name name1
0 a a
1 a b
2 a c
3 a d
4 b a
5 b b
6 b c
7 b d
8 c a
9 c b
10 c c
11 c d
12 d a
13 d b
14 d c
15 d d
Another idea is use cross join, best solution if performance is important:
df1 = df.assign(new=1)
df = df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
Performance:
from itertools import product
df = pd.DataFrame({'name':range(1000)})
# print (df)
In [17]: %%timeit
...: df["name1"] = [df["name"] for _ in df["name"]]
...: df.explode("name1")
...:
...:
18.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %%timeit
...: pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
...:
1.01 s ± 62.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %%timeit
...: df1 = df.assign(new=1)
...: df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
...:
...:
245 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [20]: %%timeit
...: pd.DataFrame({'name':np.repeat(df['name'],len(df)), 'name1':np.tile(df['name'],len(df))})
...:
30.2 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am using Pandas to try to find all those Y elements that precede the corresponding X elements in time.
df = {'time':[1,2,3,4,5,6,7,8], 'X':['x','w','r','a','k','y','u','xa'],'Y':['r','xa','a','x','w','u','k','y']}
df = pd.DataFrame.from_dict(df)
time X Y
0 1 x r
1 2 w xa
2 3 r a
3 4 a x
4 5 k w
5 6 y u
6 7 u k
7 8 xa y
What I would like to achieve is:
time X Y
0 1 x r
1 2 w xa
2 3 r a
5 6 y u
Any ideas?
You can make two dictionaries which keep track of the indexes. Then use pd.Series.map to get boolean index then use boolean indexing
idx = dict(zip(df['X'],df['time']))
idx2 = dict(zip(df['Y'],df['time']))
mask = df['Y'].map(lambda k: idx[k]>idx2[k]
df[mask]
time X Y
0 1 x r
1 2 w xa
2 3 r a
5 6 y u
df.apply over axis 1 is not recommended it should be as your last resort. Check out why
Here's timeit analysis which supports the statement.
In [74]: %%timeit
...: df[df.apply(lambda row: row['Y'] in df.loc[row.time:,'X'].values, axis=1)]
...:
...:
2.26 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [80]: %%timeit
...: idx = dict(zip(df['X'],df['time']))
...: idx2 = dict(zip(df['Y'],df['time']))
...: mask = df['Y'].map(lambda k: idx[k]>idx2[k])
...: x = df[mask]
...:
...:
498 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Almost 5X faster.
Try this:
result = df[df.apply(lambda row: row['Y'] in df.loc[row.time:,'X'].values, axis=1)]
print(result)
time X Y
0 1 x r
1 2 w xa
2 3 r a
5 6 y u
Given a pandas Series with an index:
import pandas as pd
s = pd.Series(data=[1,2,3],index=['a','b','c'])
How can a Series be used to fill the diagonal entries of an empty DataFrame in pandas version >= 0.23.0?
The resulting DataFrame would look like:
a b c
a 1 0 0
b 0 2 0
c 0 0 3
There is a prior similar question which will fill the diagonal with the same value, my question is asking to fill the diagonal with varying values from a Series.
Thank you in advance for your consideration and response.
First create DataFrame and then numpy.fill_diagonal:
import numpy as np
s = pd.Series(data=[1,2,3],index=['a','b','c'])
df = pd.DataFrame(0, index=s.index, columns=s.index, dtype=s.dtype)
np.fill_diagonal(df.values, s)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
Another solution is create empty 2d array, add values to diagonal and last use DataFrame constructor:
arr = np.zeros((len(s), len(s)), dtype=s.dtype)
np.fill_diagonal(arr, s)
print (arr)
[[1 0 0]
[0 2 0]
[0 0 3]]
df = pd.DataFrame(arr, index=s.index, columns=s.index)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
I'm not sure about directly doing it with Pandas, but you can do this easily enough if you don't mind using numpy.diag() to build the diagonal data matrix for your series and then plugging that into a DataFrame:
diag_data = np.diag(s) # don't need s.as_matrix(), turns out
df = pd.DataFrame(diag_data, index=s.index, columns=s.index)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
In one line:
df = pd.DataFrame(np.diag(s),
index=s.index,
columns=s.index)
Timing comparison with a Series made from a random array of 10000 elements:
s = pd.Series(np.random.rand(10000), index=np.arange(10000))
df = pd.DataFrame(np.diag(s), ...)
173 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
df = pd.DataFrame(0, ...)
np.fill_diagonal(df.values, s)
212 ms ± 909 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
mat = np.zeros(...)
np.fill_diagonal(mat, s)
df = pd.DataFrame(mat, ...)
175 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
It looks like the first and third option shown here are essentially the same, while the middle option is the slowest.
Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128