Using Apply in Pandas Lambda functions with multiple if statements - python

I'm trying to infer a classification according to the size of a person in a dataframe like this one:
Size
1 80000
2 8000000
3 8000000000
...
I want it to look like this:
Size Classification
1 80000 <1m
2 8000000 1-10m
3 8000000000 >1bi
...
I understand that the ideal process would be to apply a lambda function like this:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else "1-10m" if 1000000<x<10000000 else ...)
I checked a few posts regarding multiple ifs in a lambda function, here is an example link, but that synthax is not working for me for some reason in a multiple ifs statement, but it was working in a single if condition.
So I tried this "very elegant" solution:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "1-10m" if 1000000 < x < 10000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "10-50m" if 10000000 < x < 50000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "50-100m" if 50000000 < x < 100000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "100-500m" if 100000000 < x < 500000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "500m-1bi" if 500000000 < x < 1000000000 else pass)
df['Classification']=df['Size'].apply(lambda x: ">1bi" if 1000000000 < x else pass)
Works out that "pass" seems not to apply to lambda functions as well:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
SyntaxError: invalid syntax
Any suggestions on the correct synthax for a multiple if statement inside a lambda function in an apply method in Pandas? Either multi-line or single line solutions work for me.

Here is a small example that you can build upon:
Basically, lambda x: x.. is the short one-liner of a function. What apply really asks for is a function which you can easily recreate yourself.
import pandas as pd
# Recreate the dataframe
data = dict(Size=[80000,8000000,800000000])
df = pd.DataFrame(data)
# Create a function that returns desired values
# You only need to check upper bound as the next elif-statement will catch the value
def func(x):
if x < 1e6:
return "<1m"
elif x < 1e7:
return "1-10m"
elif x < 5e7:
return "10-50m"
else:
return 'N/A'
# Add elif statements....
df['Classification'] = df['Size'].apply(func)
print(df)
Returns:
Size Classification
0 80000 <1m
1 8000000 1-10m
2 800000000 N/A

You can use pd.cut function:
bins = [0, 1000000, 10000000, 50000000, ...]
labels = ['<1m','1-10m','10-50m', ...]
df['Classification'] = pd.cut(df['Size'], bins=bins, labels=labels)

The apply lambda function actually does the job here, I just wonder what the problem was.... as your syntax looks ok and it works....
df1= [80000, 8000000, 8000000000, 800000000000]
df=pd.DataFrame(df1)
df.columns=['size']
df['Classification']=df['size'].apply(lambda x: '<1m' if x<1000000 else '1-10m' if 1000000<x<10000000 else '1bi')
df
Output:

Using Numpy's searchsorted
labels = np.array(['<1m', '1-10m', '10-50m', '>50m'])
bins = np.array([1E6, 1E7, 5E7])
# Using assign is my preference as it produces a copy of df with new column
df.assign(Classification=labels[bins.searchsorted(df['Size'].values)])
If you wanted to produce new column in existing dataframe
df['Classification'] = labels[bins.searchsorted(df['Size'].values)]
Some Explanation
From Docs:np.searchsorted
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
The labels array has a length greater than that of bins by one. Because when something is greater than the maximum value in bins, searchsorted returns a -1. When we slice labels this grabs the last label.

Related

How would you optimize this short but very slow Python loop?

I'm switching from R to Python. Unfortunately, I found that while some structures run almost instantly in R, they take some seconds (and even minutes) in Python. Upon reading I found for loops are strongly discouraged in pandas, and other alternatives such as vectorization and apply are recommended.
In this sample code: From a column of values that are sorted from min to max, keep all the values that come first after a gap of length '200'.
import numpy as np
import pandas as pd
#Let's create the sample data. It consists of a column with random sorted values, and an extra True/False column, where we will flag the values we want
series = np.random.uniform(1,1000000,100000)
test = [True]*100000
data = pd.DataFrame({'series' : series, 'test':test })
data.sort_values(by=['series'], inplace=True)
#Loop to get rid of the next values that fall within the '200' threshold after the first next valid value
for i in data['series']:
if data.loc[data['series'] == i,'test'].item() == True:
data.loc[(data['series'] > i) & (data['series'] <= i+200 ) ,'test' ] = False
#Finally, let's keep the first values after any'200' threshold
data = data.loc[data['test']==True , 'series']
Is it possible to turn this into a function, vectorize, apply, or any other structure other than 'for' loop to make it run almost instantly?
This is my approach with a while loop:
head = 0
indexes = []
while head < len(data):
thresh = data['series'].iloc[head] + 200
indexes.append(head)
head += 1
while head < len(data) and data['series'].iloc[head] < thresh:
head+=1
# output:
data = data.iloc[indexes]
# double check with your approach
set(data.loc[data['test']].index) == set(data.iloc[indexes].index)
# output: True
The above took 984ms while your approach took 56s.
You can do it with a simple, one-pass algorithm using one loop over the series; no need for vectorisation or anything like that. It takes 33 milliseconds on my machine, so not "instantaneous", but blink and you'll miss it.
def first_after_gap(series, gap=200):
out = []
last = float('-inf')
for x in series:
if x - last >= gap:
out.append(x)
last = x
return out
Example:
>>> import numpy as np
>>> series = sorted(np.random.uniform(1, 1000000, 100000))
>>> from timeit import timeit
>>> timeit(lambda: first_after_gap(series), number=1)
0.03264855599991279
searchsorted
You can find the next one without looping over all... sort of.
This should be quicker.
As pointed out in the comments, quicker depends on the data.
Note that I use a similar approach as Quang because they are correct, you have to loop. The difference is that I use searchsorted to find the next position at each position rather than looping over each position and evaluating whether I should add that position.
a = data.series.to_numpy()
head = 0
indexes = [head]
while head < len(data):
head = a[head:].searchsorted(a[head] + 200) + head
if -1 < head < len(data):
indexes.append(head)
data.iloc[indexes]
series test
77193 5.663829 True
36166 210.829727 True
85730 413.206840 True
68686 613.849315 True
88026 819.096379 True
... ... ...
13863 999074.688286 True
31992 999276.058929 True
71844 999487.746496 True
84515 999690.104536 True
6029 999891.101087 True
[4761 rows x 2 columns]

Pandas agg how to count rows where a condition is true

I am using lambda function and agg() in python to perform some function on each element of the dataframe.
I have following cases
lambda x: (x==0).sum() - Question: Does this logically compute (x==0) as 1, if true, and 0, if false and then adds all ones and zeros? or is it doing something else?
lambda x: x.sum() - Question: This is apparent, but still I'll ask. This adds all the elements or x passed to it. Is this correct?
(x == 0).sum() counts the number of rows where the condition x == 0 is true. x.sum() just computes the "sum" of x (the actual result depends on the type).

List interval comparison

I am trying to make a very long program much shorter by making it concise, because I need to modify it to run through several kinds of reports. Basically, it loads a list from a report in excel, and then checks to see if those values are above or below control limits. I tried using an interval comparison to see if any value in my list was not between the control limits, but that did not work. Instead, I had to go with a little bit longer method that did work. Can someone please explain to me why the second method shown below does not work? There are no errors, but it does not find the failed tests like the first one does.
############### This is the same between the two methods #############
#Loading my list with the variables to be checked
GtimeList = [37, 37, 37, 32, 32, 32,
Gtime3b, GtimeAveb]
GT = 0
#Make sure these are numbers
if any(isinstance(x, str) for x in GtimeList):
continue
######## Method one works fine, but I want it more concise ############
#Check to see if any of the variables are not between 10 to 35
elif any(10 > x for x in GtimeList) or any(35 < x for x in GtimeList):
GT = 'Gel Time'
######## Method two, this is how I want it to work ########
#Check to see if any of the variables are not between 10 to 35
elif any(10 > x > 35 for x in GtimeList):
GT = 'Gel Time'
What you are looking for is maybe this:
any(x not in range(10,36) for x in GtimeList)
This is sort of more of a logic question than a programming question. Both of your code snippets have two conditions for each value, for a total of 2n conditions. Your first code snippet just needs one out of those 2n conditions to be true. Your second requires two of them to be true, and needs the two to be for the same value. You should replace the any in the second code with not all.
Basically, your first code is "∃ x: 10 > x and ∃ x: 35 < x", while your second is ∃x: (10 > x and x < 35). You're turning "or" into "and". Using logic rules, we can do the following:
∃ x: 10 > x or ∃ x: 35 < x ≡
not (∀ x: 10 > x) or not (∀ x: 35 < x) ≡
not ((∀ x: 10 > x) and (∀ x: 35 < x)) ≡
not ((∀ x: 10 > x and 35 < x))
You could also do min(GtimeList) < 10 or max(GtimeList) > 35.
And as a side note regarding your isinstance(x, str) check, it's generally a better idea to check whether everything is what you want it to be, rather than it isn't what you don't want it to be. What if x is something other than a string or a number, such as a list?

Implementing if-else in python dataframe using lambda when there are multiple variables

I am trying to implement if-elif or if-else logic in python while working on a dataframe. I am struggling when working with more than one column.
sample data frame
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10], "name": 'a', 'b', 'a', 'b', 'c'})
If my if-else logic is based on only one column - I know how to do it.
df['one'] = df["one"].apply(lambda x: x*10 if x<2 else (x**2 if x<4 else x+10))
But I want to modify column 'one' based on values of column 'two' - and I feel its going be something like this -
lambda x, y: x*100 if y>8 else (x*1 if y<8 else x**2)
But I am not sure how to specify the second column. I tried this way but obviously that's incorrect
df['one'] = df["one"]["two"].apply(lambda x, y: x*100 if y>8 else (x*1 if y<8 else x**2))
Question 1 - what'd be the correct syntax for the above code ?
Question 2 - How to implement below logic using lambda ?
if df['name'].isin(['a','b']) df['one'] = 100 else df['one'] = df['two']
If I write something like x.isin(['a','b']) it won't work.
Apply across columns
Use pd.DataFrame.apply instead of pd.Series.apply and specify axis=1:
df['one'] = df.apply(lambda row: row['one']*100 if row['two']>8 else \
(row['one']*1 if row['two']<8 else row['one']**2), axis=1)
Unreadable? Yes, I agree. Let's try again but this time rewrite as a named function.
Using a function
Note lambda is just an anonymous function. We can define a function explicitly and use it with pd.DataFrame.apply:
def calc(row):
if row['two'] > 8:
return row['one'] * 100
elif row['two'] < 8:
return row['one']
else:
return row['one']**2
df['one'] = df.apply(calc, axis=1)
Readable? Yes. But this isn't vectorised. We're looping through each row one at at at time. We might as well have used a list. Pandas isn't just for clever table formatting, you can use it for vectorised calculations using arrays in contiguous memory blocks. So let's try one more time.
Vectorised calculations
Using numpy.where:
df['one'] = np.where(row['two'] > 8, row['one'] * 100,
np.where(row['two'] < 8, row['one'],
row['one']**2))
There we go. Readable and efficient. We have effectively vectorised our if / else statements. Does this mean that we are doing more calculations than necessary? Yes! But this is more than offset by the way in which we are performing the calculations, i.e. with well-defined blocks of memory rather than pointers. You will find an order of magnitude performance improvement.
Another example
Well, we can just use numpy.where again.
df['one'] = np.where(df['name'].isin(['a', 'b']), 100, df['two'])
you can do
df.apply(lambda x: x["one"] + x["two"], axis=1)
but i don't think that such a long lambda as lambda x: x["one"]*100 if x["two"]>8 else (x["one"]*1 if x["two"]<8 else x["one"]**2) is very pythonic. apply takes any callback:
def my_callback(x):
if x["two"] > 8:
return x["one"]*100
elif x["two"] < 8:
return x["one"]
else:
return x["one"]**2
df.apply(my_callback, axis=1)

What is this set of all sets one line function doing?

I found this one line function on the python wiki that creates a set of all sets that can be created from a list passed as an argument.
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
Can someone please explain how this function works?
To construct the powerset, iterating over 2**len(set(x)) gives you all the binary combinations of the set.
range(2**len(set(x))) == [00000, 00001, 00010, ..., 11110, 11111]
Now you just need to test if the bit is set in i to see if you need to include it in the set, e.g.:
>>> i = 0b10010
>>> [y for j, y in enumerate(range(5)) if (i >> j) & 1]
[1, 4]
Though I'm not sure how efficient it is given the call to set(x) for every iteration. There is a small hack that would avoid that:
f = lambda x: [[y for j, y in enumerate(s) if (i >> j) & 1] for s in [set(x)] for i in range(2**len(s))]
A couple of other forms using itertools:
import itertools as it
f1 = lambda x: [list(it.compress(s, i)) for s in [set(x)] for i in it.product((0,1), repeat=len(s))]
f2 = lambda x: list(it.chain.from_iterable(it.combinations(set(x), r) for r in range(len(set(x))+1)))
Note: this last one could just return an iterable vs list if you remove list() depending on the use-case this could save some memory.
Looking at some timings of a list of 25 random numbers 0-50:
%%timeit binary: 1 loop, best of 3: 20.1 s per loop
%%timeit binary+hack: 1 loop, best of 3: 17.9 s per loop
%%timeit compress/product: 1 loop, best of 3: 5.27 s per loop
%%timeit chain/combinations: 1 loop, best of 3: 659 ms per loop
Let's rewrite it a bit and break it down step by step:
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
is equivalent to:
def f(x):
n = len(set(x))
sets = []
for i in range(n): # all combinations of members of the set in binary
set_i = []
for j, y in enumerate(set(x)):
if (i>>j) & 1: #check if bit nr j is set
set_x.append(y)
sets.append(set_i)
return sets
for an input list like [1,2,3,4], the following happens:
n=4
range(2**n)=[0,1,2,3...15]
which, in binary is:
0,1,10,11,100...1110,1111
Enumerate makes tuples of y with its index, so in our case:
[(0,1),(1,2),(2,3),(3,4)]
The (i>>j) & 1 part might require some explanation.
(i>>j) shifts the number i j places to the right, e.g. in decimal: 4>>2=1, or in binary:100>>2=001. The & is the bit-wise and operator. This checks, for every bit of both operands, if they are 1 and returns the result as a number, acting like a filter: 10111 & 11001 = 10101.
In the case of our example, it checks if the bit at place j is 1. If it is, the corresponding value is added to the result list. This way the binary map of combinations is converted to a list of lists, which is returned.

Categories