Adding dataframes and dividing resultant based on availability - python

I want to add two dataframes which I can achieve by add function.
Now I want to divide each value of resultant dataframe based on whether respective value was present in initial dataframes(df1,df2,df3). For eg.
df1 = pd.DataFrame([[1,2],[3,4]], index =['A','B'], columns = ['C','D'])
df2 = pd.DataFrame([[11,12], [13,14]], index = ['A','B'], columns = ['D','E'])
df3 = df1.add(df2, fill_value=0)
This would result in a df like
C D E
A 1.0 13 12.0
B 3.0 17 14.0
I require a df like:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
because D column is found in both dataframes, I divide those values by 2.
Can anyone please provide a generic solution, assuming I need to add more than 2 dataframes (so the division factor also changes) and have more than 100 columns in each dataframe.

We can concatenate all DFs horizontally in one step:
In [13]: df = pd.concat([df1,df2], axis=1).fillna(0)
this yields:
In [15]: df
Out[15]:
C D D E
A 1 2 11 12
B 3 4 13 14
now we can group by columns, calculating average (mean):
In [14]: df.groupby(df.columns, axis=1).mean()
Out[14]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
or we can do it in one step (thanks #jezrael):
In [60]: pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean()
Out[60]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timing:
In [38]: df1 = pd.concat([df1] * 10**5, ignore_index=True)
In [39]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [40]: %%timeit
...: df = pd.concat([df1,df2], axis=1).fillna(0)
...: df.groupby(df.columns, axis=1).mean()
...:
63.4 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %%timeit
...: s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
...: df1.add(df2, fill_value=0).div(s)
...:
28.7 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [42]: %%timeit
...: pd.concat([df1,df2]).mean(level = 0)
...:
65.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: df1.shape
Out[43]: (200000, 2)
In [44]: df2.shape
Out[44]: (200000, 2)
Current winner: #jezrael (28.7 ms ± 712 µs) - congratulations!

It looks like you are trying to compute a mean. Don't do too many operations with the dataframe methods and individual columns if you can help it, as it's slow.
df = pd.concat([df1,df2]) # concatenate all your dataframes together
df.mean(level = 0)
The second line computes the mean along the vertical axis (axis = 0 by default), and level = 0 tells pandas to get the mean of each unique index.

Faster solution is divide by size of columns:
s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
print (s)
C 1
D 2
E 1
dtype: int64
df3 = df1.add(df2, fill_value=0).div(s)
print (df3)
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timings (with 100 columns like OP mentioned):
np.random.seed(123)
N = 100000
df1 = pd.DataFrame(np.random.randint(10, size=(N, 100)))
df1.columns = 'col' + df1.columns.astype(str)
df2 = df1.mul(10)
#MaxU solution
In [127]: %timeit (pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean())
1 loop, best of 3: 952 ms per loop
#Ken Wei solution
In [128]: %timeit (pd.concat([df1,df2]).mean(level = 0))
1 loop, best of 3: 895 ms per loop
#jez solution
In [129]: %timeit (df1.add(df2, fill_value=0).div(pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()))
10 loops, best of 3: 161 ms per loop
More general solution:
If have list of DataFrames, is possible chaning like:
df = df1.add(df2, fill_value=0).add(df3, fill_value=0)
but better is use reduce:
from functools import reduce
dfs = [df1,df2, df3]
s = pd.Series(np.concatenate([x.columns for x in dfs])).value_counts()
df5 = reduce(lambda x, y: x.add(y, fill_value=0), dfs).div(s)

Related

Fetch the column names per row in a dataframe that are not NaN-values (Python)

I have a dataframe that has several features and a feature can have a NaN-value. E.g.
feature1 feature2 feature3 feature4
10 NaN 5 2
2 1 3 1
NaN 2 4 NaN
Note: the columns can also contain strings.
How could we get a list/array per row that contains the column name of non NaN-values?
Thus the result array of my example would be:
res = array([feature1, feature3, feature4], [feature1, feature2, feature3, feature4],
[feature2, feature3])
For improve performance use list comprehension with convert values to numpy array:
c = df.columns.to_numpy()
res = [c[x].tolist() for x in df.notna().to_numpy()]
print (res)
[['feature1', 'feature3', 'feature4'],
['feature1', 'feature2', 'feature3', 'feature4'],
['feature2', 'feature3']]
df = pd.concat([df] * 1000, ignore_index=True)
In [28]: %%timeit
...: out = (df.stack().reset_index().groupby('level_0')['level_1']
...: .agg(list).to_numpy().tolist()
...: )
...:
...:
96.5 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: %%timeit
...: c = df.columns.to_numpy()
...: res = [c[x].tolist() for x in df.notna().to_numpy()]
...:
3.36 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can stack to keep only the non-NAN values, and aggregate as list with groupby.agg:
out = df.stack().reset_index().groupby('level_0')['level_1'].agg(list)
Output as Series:
level_0
0 [feature1, feature3, feature4]
1 [feature1, feature2, feature3, feature4]
2 [feature2, feature3]
Name: level_1, dtype: object
As lists:
out = (df.stack().reset_index().groupby('level_0')['level_1']
.agg(list).to_numpy().tolist()
)
Output:
[['feature1', 'feature3', 'feature4'],
['feature1', 'feature2', 'feature3', 'feature4'],
['feature2', 'feature3']]

Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?

I was wondering if there is a possibility of calling idxmin and min at the same time (in the same call/loop).
Assuming the following dataframe:
id option_1 option_2 option_3 option_4
0 0 10.0 NaN NaN 110.0
1 1 NaN 20.0 200.0 NaN
2 2 NaN 300.0 30.0 NaN
3 3 400.0 NaN NaN 40.0
4 4 600.0 700.0 50.0 50.0
I would like to calculate the minimum value (min) and the column that contains it (idxmin) of the option_ series:
id option_1 option_2 option_3 option_4 min_column min_value
0 0 10.0 NaN NaN 110.0 option_1 10.0
1 1 NaN 20.0 200.0 NaN option_2 20.0
2 2 NaN 300.0 30.0 NaN option_3 30.0
3 3 400.0 NaN NaN 40.0 option_4 40.0
4 4 600.0 700.0 50.0 50.0 option_3 50.0
Obviously, I can call idxmin and min separatedly (one after the other, see example below), but is there a way of making this more efficient without searching the matrix twice (one for the value and another for the index)?
An example calling min and idxmin
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
(I expected this would be suboptimal as the search is performed twice.)
Google Colab
GitHub
transpose then agg
df.set_index('id').T.agg(['min', 'idxmin']).T
min idxmin
0 10 option_1
1 20 option_2
2 30 option_3
3 40 option_4
4 50 option_3
Numpy v1
d_ = df.set_index('id')
v = d_.values
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
Idxmin=d_.columns[np.nanargmin(v, axis=1)]
), d_.index)
Idxmin Min
id
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0
Numpy v2
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
Full Simulation
Conclusion
The Numpy solutions are fastest.
Results
10 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 12.465358 1.272584 1.0 5.978435 2.168994 2.164858
30 26.538924 1.305721 1.0 5.331755 2.121342 2.193279
100 80.304708 1.277684 1.0 7.221127 2.215901 2.365835
300 230.009000 1.338177 1.0 5.869560 2.505447 2.576457
1000 661.432965 1.249847 1.0 8.931438 2.940030 3.002684
3000 1757.339186 1.349861 1.0 12.541915 4.656864 4.961188
10000 3342.701758 1.724972 1.0 15.287138 6.589233 6.782102
100 columns
pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2
10 8.008895 1.000000 1.977989 5.612195 1.727308 1.769866
30 18.798077 1.000000 1.855291 4.350982 1.618649 1.699162
100 56.725786 1.000000 1.877474 6.749006 1.780816 1.850991
300 132.306699 1.000000 1.535976 7.779359 1.707254 1.721859
1000 253.771648 1.000000 1.232238 12.224478 1.855549 1.639081
3000 346.999495 2.246106 1.000000 21.114310 1.893144 1.626650
10000 431.135940 2.095874 1.000000 32.588886 2.203617 1.793076
Functions
def pir_agg_1(df):
return df.set_index('id').T.agg(['min', 'idxmin']).T
def pir_agg_2(df):
d_ = df.set_index('id')
v = d_.values
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=d_.columns[np.nanargmin(v, axis=1)]
))
def pir_agg_3(df):
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
return pd.DataFrame(dict(
Min=np.nanmin(v, axis=1),
IdxMin=options[np.nanargmin(v, axis=1)]
))
def wen_agg_1(df):
v = df.filter(like='option')
d = v.stack().sort_values().groupby(level=0).head(1).reset_index(level=1)
d.columns = ['IdxMin', 'Min']
return d
def tot_agg_1(df):
"""I combined toto_tico's 2 filter calls into one"""
d = df.filter(like='option')
return df.assign(
IdxMin=d.idxmin(1),
Min=d.min(1)
)
def tot_agg_2(df):
d = df.filter(like='option')
idxmin = d.idxmin(1)
return df.assign(
IdxMin=idxmin,
Min=d.lookup(d.index, idxmin)
)
Sim setup
def sim_df(n, m):
return pd.DataFrame(
np.random.randint(m, size=(n, m))
).rename_axis('id').add_prefix('option').reset_index()
fs = 'pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2'.split()
ix = [10, 30, 100, 300, 1000, 3000, 10000]
res_small_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
res_large_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
for i in ix:
df = sim_df(i, 10)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_small_col.at[i, j] = timeit(stmt, setp, number=10)
for i in ix:
df = sim_df(i, 100)
for j in fs:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res_large_col.at[i, j] = timeit(stmt, setp, number=10)
Maybe using stack with groupby
v=df.filter(like='option')
v.stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1)
Out[313]:
level_1 0
0 option_1 10.0
1 option_2 20.0
2 option_3 30.0
3 option_4 40.0
4 option_3 50.0
UPDATE 2:
The numpy solution of #piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.
UPDATE 1:
According to my tests calling min and idxmin separatedly is the fastest you can do based on all the proposed answers.
Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup on the column indexes (min_column colum), in order to avoid the search for values (min_values).
So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column series - which is O(n):
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
Direct answer (not as efficient)
Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.
Bottom line is pandas/numpy vectorized operations are very fast.
Summary of the summary:
There doesn't seem to be any advantage in using df.lookup, calling min and idxmin separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.
Summary of the timings:
I tested a dataframe with 10000 rows and 10 columns (option_ sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:
Using numpy as #piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.
Calling min and idxmin separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup (although, the Dataframe.lookup result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.
The solution provided by #Wen followed in performance, though it was not better than calling idxmin and min separatedly, or using Dataframe.lookup. I did an extra test (see test7()) because I felt that the the addition of operation (reset_index and zip might be disturbing the result. It was still worse than test1 and test2, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)). #Wen, would you mind giving me a hand?
#Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.
Any other solution with lambda expression, or that uses the transpose (T) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by #piRSquared and #RafaelC.
TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):
Using the following dataframe of 10000 rows and 10 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
Calling the two columns twice separatedly:
def test1():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
%timeit -n 100 test1()
13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Calling the lookup (it is slower for this case!):
def test2():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
%timeit -n 100 test2()
# 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using apply and min_idxmin(x):
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
def test3():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
%timeit -n 10 test3()
# 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using agg['min', 'idxmin'] by #piRSquared:
def test4():
df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
%timeit -n 1 test4()
# 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using agg['min', 'idxmin'] by #RafaelC:
def test5():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
%timeit -n 1 test5()
# 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sorting values by #Wen:
def test6():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
%timeit -n 100 test6()
# 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sorting values by #Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):
def test7():
df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
%timeit -n 100 test7()
# 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy:
def test8():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
%timeit -n 100 test8()
# 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy but avoid the search (indexing instead):
def test9():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
idxmin = np.nanargmin(v, axis=1)
# instead of looking for the answer, indexes are used
df.assign(min_value = v[range(v.shape[0]), idxmin],
min_column = options[idxmin])
%timeit -n 100 test9()
# 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
TimeIt results with 1000 rows x 1000 columns:
I perform more test with a 1000x1000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()
Although the results change:
test1 ~27.6ms
test2 ~29.4ms
test3 ~135ms
test4 ~1.18s
test5 ~1.29s
test6 ~287ms
test7 ~290ms
test8 ~25.7
test9 ~26.1
TimeIt results with 100 rows x 10000 columns:
I perform more test with a 100x10000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()
Although the results change:
test1 ~46.8ms
test2 ~25.6ms
test3 ~101ms
test4 ~289ms
test5 ~276ms
test6 ~349ms
test7 ~301ms
test8 ~121ms
test9 ~122ms

fastest way to change multiple loc in a dataframe

I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)
Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop

Efficient and fastest way in Pandas to create sorted list from column values

Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128

Get first letter of a string from column

I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.

Categories