Convert whole dataframe from lower case to upper case with Pandas - python

I have a dataframe like the one displayed below:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks'],
'company': ['1st', '1st', '2nd', '2nd'],
'deaths': ['kkk', 52, '25', 616],
'battles': [5, '42', 2, 2],
'size': ['l', 'll', 'l', 'm']}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'deaths', 'battles', 'size'])
My goal is to transform every single string inside of the dataframe to upper case so that it looks like this:
Notice: all data types are objects and must not be changed; the output must contain all objects. I want to avoid to convert every single column one by one... I would like to do it generally over the whole dataframe possibly.
What I tried so far is to do this but without success
df.str.upper()

astype() will cast each series to the dtype object (string) and then call the str() method on the converted series to get the string literally and call the function upper() on it. Note that after this, the dtype of all columns changes to object.
In [17]: df
Out[17]:
regiment company deaths battles size
0 Nighthawks 1st kkk 5 l
1 Nighthawks 1st 52 42 ll
2 Nighthawks 2nd 25 2 l
3 Nighthawks 2nd 616 2 m
In [18]: df.apply(lambda x: x.astype(str).str.upper())
Out[18]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
You can later convert the 'battles' column to numeric again, using to_numeric():
In [42]: df2 = df.apply(lambda x: x.astype(str).str.upper())
In [43]: df2['battles'] = pd.to_numeric(df2['battles'])
In [44]: df2
Out[44]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
In [45]: df2.dtypes
Out[45]:
regiment object
company object
deaths object
battles int64
size object
dtype: object

This can be solved by the following applymap method:
df = df.applymap(lambda s: s.lower() if type(s) == str else s)

Loops are very slow instead of using apply function to each and cell in a row, try to get columns names in a list and then loop over list of columns to convert each column text to lowercase.
Code below is the vector operation which is faster than apply function.
for columns in dataset.columns:
dataset[columns] = dataset[columns].str.lower()

Since str only works for series, you can apply it to each column individually then concatenate:
In [6]: pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
Out[6]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
Edit: performance comparison
In [10]: %timeit df.apply(lambda x: x.astype(str).str.upper())
100 loops, best of 3: 3.32 ms per loop
In [11]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
100 loops, best of 3: 3.32 ms per loop
Both answers perform equally on a small dataframe.
In [15]: df = pd.concat(10000 * [df])
In [16]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
10 loops, best of 3: 104 ms per loop
In [17]: %timeit df.apply(lambda x: x.astype(str).str.upper())
10 loops, best of 3: 130 ms per loop
On a large dataframe my answer is slightly faster.

try this
df2 = df2.apply(lambda x: x.str.upper() if x.dtype == "object" else x)

If you want to conserve the dtype use isinstance(obj,type)
df.apply(lambda x: x.str.upper().str.strip() if isinstance(x, object) else x)

if you want conserv dtype or change only one type.. try for and if:
for x in dados.columns:
if dados[x].dtype == 'object':
print('object - allow upper')
dados[x] = dados[x].str.upper()
else:
print('other? - not allow upper')
dados[x] = dados[x].str.upper()

You can apply it for every cols...
oh_df.columns = map(str.lower, oh_df.columns)

Related

How can I get one values that exist by column in dataframe? [duplicate]

If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
​
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

how to assign serial number to dates [duplicate]

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

Pandas: remove duplicate rows when matches can be in switched columns [duplicate]

What is the proper way to go from this df:
>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
a b
0 jeff bob
1 bob jeff
2 jill mike
To this:
>>> df2
a b
0 jeff bob
2 jill mike
where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.
I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:
>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
I think you can sort each row independently and then use duplicated to see which ones to drop.
dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]
A faster way to get dupes. Thanks to #DSM.
dupes = df.T.apply(sorted).T.duplicated()
I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:
df = df[~df.apply(sorted, 1).duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
Timings:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
'B': np.random.randint(100,size=N).astype(str)})
#print (df)
In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop
In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop
#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop
#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop

Apply function in separate column in Pandas dataframe [duplicate]

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

Trying to apply a datetime to a function in Pandas [duplicate]

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

Categories