i have below data frame and want to do loop:
df = name
a
b
c
d
i have tried below code:
for index, row in df.iterrows():
for line in df['name']:
print(index, line)
but the result i want is a dataframe as below:
df = name name1
a a
a b
a c
a d
b a
b b
b c
b d
etc.
is there any possible way to do it? i know its a stupid question but im new to python
One way using pandas.DataFrame.explode:
df["name1"] = [df["name"] for _ in df["name"]]
df.explode("name1")
Output:
name name1
0 a a
0 a b
0 a c
0 a d
1 b a
1 b b
1 b c
1 b d
2 c a
2 c b
2 c c
2 c d
3 d a
3 d b
3 d c
3 d d
Fastest solution in numpy, thank you #Ch3steR:
df = pd.DataFrame({'name':np.repeat(df['name'],len(df)),
'name1':np.tile(df['name'],len(df))}
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
#oldier pandas versions
#df = pd.DataFrame(list(product(df['name'], df['name'])), columns=['name','name1'])
print (df)
name name1
0 a a
1 a b
2 a c
3 a d
4 b a
5 b b
6 b c
7 b d
8 c a
9 c b
10 c c
11 c d
12 d a
13 d b
14 d c
15 d d
Another idea is use cross join, best solution if performance is important:
df1 = df.assign(new=1)
df = df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
Performance:
from itertools import product
df = pd.DataFrame({'name':range(1000)})
# print (df)
In [17]: %%timeit
...: df["name1"] = [df["name"] for _ in df["name"]]
...: df.explode("name1")
...:
...:
18.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %%timeit
...: pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
...:
1.01 s ± 62.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %%timeit
...: df1 = df.assign(new=1)
...: df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
...:
...:
245 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [20]: %%timeit
...: pd.DataFrame({'name':np.repeat(df['name'],len(df)), 'name1':np.tile(df['name'],len(df))})
...:
30.2 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I have a dataframe like this
L1 L2 L3 L4 L5
A 1 2 3 4 5
B 1 2 4 3 5
C 1 3 3 2 1
I want to calculate the number of differences between rows, for example the number of differences between A and B is 2, A and C is 3, B and C is 4.
What I really want is a difference matrix, such as
A B C
A 0 2 3
B 2 0 4
C 3 4 0
First loop solution is iterate by each row, compare by DataFrame and sum:
df = df.apply(lambda x: df.ne(x).sum(axis=1), axis=1)
print (df)
A B C
A 0 2 3
B 2 0 4
C 3 4 0
Or for improve performance are compared values in numpy with broadcasting for 3d array, sum and last is used DataFrame constructor:
a = df.to_numpy()
out = pd.DataFrame((a != a[:, None]).sum(2), index=df.index, columns=df.index)
print (out)
A B C
A 0 2 3
B 2 0 4
C 3 4 0
np.random.seed(123)
df = pd.DataFrame( np.random.randint(20, size=(100, 500)))
print (df)
In [119]: %%timeit
...: df.apply(lambda x: df.ne(x).sum(axis=1), axis=1)
...:
...:
12.8 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [120]: %%timeit
...: a = df.to_numpy()
...: pd.DataFrame((a != a[:, None]).sum(2), index=df.index, columns=df.index)
...:
...:
14.6 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a dataframe and a list:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})
names = ['a','c']
I want to get rows only if both a and c both are present in char column.(order doesn't matter here)
Expected Output:
char id
1 [a, b, c] 2
2 [a, c] 3
5 [c, a, d] 6
My Efforts
true_indices = []
for idx, row in df.iterrows():
if all(name in row['char'] for name in names):
true_indices.append(idx)
ids = df[df.index.isin(true_indices)]
Which is giving me correct output but it is too slow for large dataset so I am looking for more efficient solution.
Use pd.DataFrame.apply:
df[df['char'].apply(lambda x: set(names).issubset(x))]
Output:
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
You can build a set from the list of names for a faster lookup, and use set.issubset to check if all elements in the set are contained in the column lists:
names = set(['a','c'])
df[df['char'].map(names.issubset)]
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Use list comprehension with issubset:
mask = [set(names).issubset(x) for x in df['char']]
df = df[mask]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Another solution with Series.map:
df = df[df['char'].map(set(names).issubset)]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Performance Depends of number of rows and number of matched values:
df = pd.concat([df] * 10000, ignore_index=True)
In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]
45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [271]: %%timeit
...: names = set(['a','c'])
...: [names.issubset(set(row)) for _,row in df.char.iteritems()]
...:
46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %%timeit
...: df[[set(names).issubset(x) for x in df['char']]]
...:
45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [273]: %%timeit
...: df[df['char'].map(set(names).issubset)]
...:
18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [274]: %%timeit
...: n = set(names)
...: df[df['char'].map(n.issubset)]
...:
16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [279]: %%timeit
...: names = set(['a','c'])
...: m = [name.issubset(i) for i in df.char.values.tolist()]
...:
19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Try this.
df['char']=df['char'].apply(lambda x: x if ("a"in x and "c" in x) else np.nan)
print(df.dropna())
output:
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Given the following dataframe df:
df = pd.DataFrame({'A':['Tony', 'Mike', 'Jen', 'Anna'], 'B': ['no', 'yes', 'no', 'yes']})
A B
0 Tony no
1 Mike yes
2 Jen no
3 Anna yes
I want to add another column that counts, progressively, the elements with df['B']='yes':
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
How can I do this?
You can use numpy.where with cumsum of boolean mask:
m = df['B']=='yes'
df['C'] = np.where(m, m.cumsum(), 0)
Another solution is count boolean mask created by filtering and then add 0 values by reindex:
m = df['B']=='yes'
df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
print (df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
Performance (in real data should be different, best check it first):
np.random.seed(123)
N = 10000
L = ['yes','no']
df = pd.DataFrame({'B': np.random.choice(L, N)})
print (df)
In [150]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = np.where(m, m.cumsum(), 0)
...:
1.57 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [151]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
...:
2.53 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %%timeit
...: df['C'] = df.groupby('B').cumcount() + 1
...: df['C'].where(df['B'] == 'yes', 0, inplace=True)
4.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use GroupBy + cumcount followed by pd.Series.where:
df['C'] = df.groupby('B').cumcount() + 1
df['C'].where(df['B'] == 'yes', 0, inplace=True)
print(df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128