How to map new variable in pandas in effective way - python

Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.

Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

fastest way to change multiple loc in a dataframe

I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)
Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop

Setting highest value in row to 1 and rest to 0 in pandas

My original dataframe looks like this :
A B C
0.10 0.83 0.07
0.40 0.30 0.30
0.70 0.17 0.13
0.72 0.04 0.24
0.15 0.07 0.78
I would like that each row becomes binarized : 1 would be assigned to the column with the highest value and the rest would be set to 0, so the previous dataframe would become :
A B C
0 1 0
1 0 0
1 0 0
1 0 0
0 0 1
How can this be done ?
Thanks.
EDIT : I understand that a specific case made my question ambiguous. I should've said that in case 3 columns are equal for a given row, I'd still want to get a [1 0 0] vector and not [1 1 1] for that row.
Using numpy with argmax
m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1
df1 = pd.DataFrame(m, columns = df.columns).astype(int)
# Result
A B C
0 0 1 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1
Timings
df_test = df.concat([df] * 1000)
def chris_z(df):
m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1
return pd.DataFrame(m, columns = df.columns).astype(int)
def haleemur(df):
return df.apply(lambda x: x == x.max(), axis=1).astype(int)
def haleemur_2(df):
return pd.DataFrame((df.T == df.T.max()).T.astype(int), columns=df.columns)
def sacul(df):
return pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T
Results
In [320]: %timeit chris_z(df_test)
358 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit haleemur(df_test)
1.14 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [329]: %timeit haleemur_2(df_test)
972 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [333]: %timeit sacul(df_test)
1.01 ms ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.apply(lambda x: x == x.max(), axis=1).astype(int)
should do it. This works by checking if the value is the maximum of that column, and then casting to integer (True -> 1, False -> 0)
Instead of apply-ing a lambda row-wise, it is also possible to transpose the dataframe & compare to max and then transpose back
(df.T == df.T.max()).T.astype(int)
And lastly, a very fast numpy based solution:
pd.DataFrame((df.T.values == np.amax(df.values, 1)).T*1, columns = df.columns)
The output is in all cases:
A B C
0 0 1 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1
Another numpy method, using np.where:
import numpy as np
new_df = pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T
A B C
0 0 1 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1

Selecting row based on a column variation

Suppose we have a file, named any_csv.csv, containing...
A,B,random
1,2,300
3,4,300
5,6,300
1,2,300
3,4,350
8,9,350
4,5,350
5,6,320
7,8,300
3,3,300
I wish to keep all the rows, where random variates/changes.
I made this little program to achieve this, but, as I wish to learn more about pandas and as my program is slower than I expect it to be (~130 seconds to process a 1.2 million line log file), I ask for your help.
import pandas as pd
import numpy as np
df = pd.read_csv('any_csv.csv')
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
df = df.loc[mask]
df.to_csv('any_other_csv.csv', index=False)
I guess that in short, I wish to know how to apply my if, in this homemade for-loop, that is averall pretty slow.
Results :
A,B,random
1,2,300
1,2,300
3,4,350
4,5,350
5,6,320
7,8,300
3,3,300
You can utilize pd.Series.shift to create a mask of Boolean values. The Boolean mask indicates when a value is different to a value above or below it within the series.
You can then apply the Boolean mask to your dataframe directly.
mask = (df['random'] != df['random'].shift()) | \
(df['random'] != df['random'].shift(-1))
df = df[mask]
print(df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
Use boolean indexing with 2 masks for check different values with shift and ne for not equal:
df = df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))]
print (df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
For better verifying:
df['mask1'] = df['random'].ne(df['random'].shift())
df['mask2'] = df['random'].ne(df['random'].shift(-1))
df['mask3'] = df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))
print (df)
A B random mask1 mask2 mask3
0 1 2 300 True False True
1 3 4 300 False False False
2 5 6 300 False False False
3 1 2 300 False True True
4 3 4 350 True False True
5 8 9 350 False False False
6 4 5 350 False True True
7 5 6 320 True True True
8 7 8 300 True False True
9 3 3 300 False True True
Timings:
N = 1000
In [157]: %timeit orig(df)
56.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (df[df['random'].ne(df['random'].shift()) |
df['random'].ne(df['random'].shift(-1))])
939 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [159]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.11 ms ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 10000
In [160]: %timeit orig(df)
538 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit (df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))])
1.16 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [162]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.28 ms ± 8.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(123)
N = 1000
df = pd.DataFrame({'random':np.random.randint(2, size=N)})
print (df)
def orig(df):
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
return df.loc[mask]
You could try something like below:`
df.groupby(["A", "Random"]).filter(lambda df:df.shape[0] == 1)

Efficient and fastest way in Pandas to create sorted list from column values

Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128

Pandas Convert Dataframe Values to Labels

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories