Pandas agg how to count rows where a condition is true - python

I am using lambda function and agg() in python to perform some function on each element of the dataframe.
I have following cases
lambda x: (x==0).sum() - Question: Does this logically compute (x==0) as 1, if true, and 0, if false and then adds all ones and zeros? or is it doing something else?
lambda x: x.sum() - Question: This is apparent, but still I'll ask. This adds all the elements or x passed to it. Is this correct?

(x == 0).sum() counts the number of rows where the condition x == 0 is true. x.sum() just computes the "sum" of x (the actual result depends on the type).

Related

Implementing if-else in python dataframe using lambda when there are multiple variables

I am trying to implement if-elif or if-else logic in python while working on a dataframe. I am struggling when working with more than one column.
sample data frame
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10], "name": 'a', 'b', 'a', 'b', 'c'})
If my if-else logic is based on only one column - I know how to do it.
df['one'] = df["one"].apply(lambda x: x*10 if x<2 else (x**2 if x<4 else x+10))
But I want to modify column 'one' based on values of column 'two' - and I feel its going be something like this -
lambda x, y: x*100 if y>8 else (x*1 if y<8 else x**2)
But I am not sure how to specify the second column. I tried this way but obviously that's incorrect
df['one'] = df["one"]["two"].apply(lambda x, y: x*100 if y>8 else (x*1 if y<8 else x**2))
Question 1 - what'd be the correct syntax for the above code ?
Question 2 - How to implement below logic using lambda ?
if df['name'].isin(['a','b']) df['one'] = 100 else df['one'] = df['two']
If I write something like x.isin(['a','b']) it won't work.
Apply across columns
Use pd.DataFrame.apply instead of pd.Series.apply and specify axis=1:
df['one'] = df.apply(lambda row: row['one']*100 if row['two']>8 else \
(row['one']*1 if row['two']<8 else row['one']**2), axis=1)
Unreadable? Yes, I agree. Let's try again but this time rewrite as a named function.
Using a function
Note lambda is just an anonymous function. We can define a function explicitly and use it with pd.DataFrame.apply:
def calc(row):
if row['two'] > 8:
return row['one'] * 100
elif row['two'] < 8:
return row['one']
else:
return row['one']**2
df['one'] = df.apply(calc, axis=1)
Readable? Yes. But this isn't vectorised. We're looping through each row one at at at time. We might as well have used a list. Pandas isn't just for clever table formatting, you can use it for vectorised calculations using arrays in contiguous memory blocks. So let's try one more time.
Vectorised calculations
Using numpy.where:
df['one'] = np.where(row['two'] > 8, row['one'] * 100,
np.where(row['two'] < 8, row['one'],
row['one']**2))
There we go. Readable and efficient. We have effectively vectorised our if / else statements. Does this mean that we are doing more calculations than necessary? Yes! But this is more than offset by the way in which we are performing the calculations, i.e. with well-defined blocks of memory rather than pointers. You will find an order of magnitude performance improvement.
Another example
Well, we can just use numpy.where again.
df['one'] = np.where(df['name'].isin(['a', 'b']), 100, df['two'])
you can do
df.apply(lambda x: x["one"] + x["two"], axis=1)
but i don't think that such a long lambda as lambda x: x["one"]*100 if x["two"]>8 else (x["one"]*1 if x["two"]<8 else x["one"]**2) is very pythonic. apply takes any callback:
def my_callback(x):
if x["two"] > 8:
return x["one"]*100
elif x["two"] < 8:
return x["one"]
else:
return x["one"]**2
df.apply(my_callback, axis=1)

Using Apply in Pandas Lambda functions with multiple if statements

I'm trying to infer a classification according to the size of a person in a dataframe like this one:
Size
1 80000
2 8000000
3 8000000000
...
I want it to look like this:
Size Classification
1 80000 <1m
2 8000000 1-10m
3 8000000000 >1bi
...
I understand that the ideal process would be to apply a lambda function like this:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else "1-10m" if 1000000<x<10000000 else ...)
I checked a few posts regarding multiple ifs in a lambda function, here is an example link, but that synthax is not working for me for some reason in a multiple ifs statement, but it was working in a single if condition.
So I tried this "very elegant" solution:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "1-10m" if 1000000 < x < 10000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "10-50m" if 10000000 < x < 50000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "50-100m" if 50000000 < x < 100000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "100-500m" if 100000000 < x < 500000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "500m-1bi" if 500000000 < x < 1000000000 else pass)
df['Classification']=df['Size'].apply(lambda x: ">1bi" if 1000000000 < x else pass)
Works out that "pass" seems not to apply to lambda functions as well:
df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
SyntaxError: invalid syntax
Any suggestions on the correct synthax for a multiple if statement inside a lambda function in an apply method in Pandas? Either multi-line or single line solutions work for me.
Here is a small example that you can build upon:
Basically, lambda x: x.. is the short one-liner of a function. What apply really asks for is a function which you can easily recreate yourself.
import pandas as pd
# Recreate the dataframe
data = dict(Size=[80000,8000000,800000000])
df = pd.DataFrame(data)
# Create a function that returns desired values
# You only need to check upper bound as the next elif-statement will catch the value
def func(x):
if x < 1e6:
return "<1m"
elif x < 1e7:
return "1-10m"
elif x < 5e7:
return "10-50m"
else:
return 'N/A'
# Add elif statements....
df['Classification'] = df['Size'].apply(func)
print(df)
Returns:
Size Classification
0 80000 <1m
1 8000000 1-10m
2 800000000 N/A
You can use pd.cut function:
bins = [0, 1000000, 10000000, 50000000, ...]
labels = ['<1m','1-10m','10-50m', ...]
df['Classification'] = pd.cut(df['Size'], bins=bins, labels=labels)
The apply lambda function actually does the job here, I just wonder what the problem was.... as your syntax looks ok and it works....
df1= [80000, 8000000, 8000000000, 800000000000]
df=pd.DataFrame(df1)
df.columns=['size']
df['Classification']=df['size'].apply(lambda x: '<1m' if x<1000000 else '1-10m' if 1000000<x<10000000 else '1bi')
df
Output:
Using Numpy's searchsorted
labels = np.array(['<1m', '1-10m', '10-50m', '>50m'])
bins = np.array([1E6, 1E7, 5E7])
# Using assign is my preference as it produces a copy of df with new column
df.assign(Classification=labels[bins.searchsorted(df['Size'].values)])
If you wanted to produce new column in existing dataframe
df['Classification'] = labels[bins.searchsorted(df['Size'].values)]
Some Explanation
From Docs:np.searchsorted
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
The labels array has a length greater than that of bins by one. Because when something is greater than the maximum value in bins, searchsorted returns a -1. When we slice labels this grabs the last label.

Check for a condition within an array

I would like to check a condition inside an array and perform an operation on the position where the condition is met. For example, this piece of code does the job:
res = somefunction(x)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if not 6 < res[i,j] < 18:
x[i,j] = float('nan')
But I thought a faster (and shorter) way would maybe be something like this:
x[not 6 < somefunction(x) < 18] = float('nan')
But python gives the error that condition checking doesn't work in array with more than element. Is there a way to make my code go faster?
You can't use not or chained comparisons with arrays, since neither not nor chained comparisons can be implemented to broadcast.
Split the chained comparison into two comparisons, and use ~ and & instead of not and and, since NumPy uses the bitwise operators for boolean operations on boolean arrays:
x[~((6 < res) & (res < 18))] = numpy.nan

Evaluating a function using numpy

What is the significance of the return part when evaluating functions? Why is this necessary?
Your assumption is right: dfdx[0] is indeed the first value in that array, so according to your code that would correspond to evaluating the derivative at x=-1.0.
To know the correct index where x is equal to 0, you will have to look for it in the x array.
One way to find this is the following, where we find the index of the value where |x-0| is minimal (so essentially where x=0 but float arithmetic requires taking some precautions) using argmin :
index0 = np.argmin(np.abs(x-0))
And we then get what we want, dfdx at the index where x is 0 :
print dfdx[index0]
An other but less robust way regarding float arithmetic trickery is to do the following:
# we make a boolean array that is True where x is zero and False everywhere else
bool_array = (x==0)
# Numpy alows to use a boolean array as a way to index an array
# Doing so will get you the all the values of dfdx where bool_array is True
# In our case that will hopefully give us dfdx where x=0
print dfdx[bool_array]
# same thing as oneliner
print dfdx[x==0]
You give the answer. x[0] is -1.0, and you want the value at the middle of the array.`np.linspace is the good function to build such series of values :
def f1(x):
g = np.sin(math.pi*np.exp(-x))
return g
n = 1001 # odd !
x=linspace(-1,1,n) #x[n//2] is 0
f1x=f1(x)
df1=np.diff(f1(x),1)
dx=np.diff(x)
df1dx = - math.pi*np.exp(-x)*np.cos(math.pi*np.exp(-x))[:-1] # to discard last element
# In [3]: np.allclose(df1/dx,df1dx,atol=dx[0])
# Out[3]: True
As an other tip, numpy arrays are more efficiently and readably used without loops.

Python lambda function

What is happening here?
reduce(lambda x,y: x+y, [x for x in range(1,1000) if x % 3 == 0 or x % 5 == 0])
I understand how x is iterating through all of the numbers from 1 to 999 and taking out those that are divisible by 3 or 5, but the 'lambda x,y: x+y' part is stumping me.
This is bad Python for
sum(x for x in range(1,1000) if x % 3 == 0 or x % 5 == 0)
It simply sums all numbers in the range 1..999 divisible by 3 or 5.
reduce() applies the given function to the first two items of the iterable, then to the result and the next item of the iterable, and so on. In this example, the function
lambda x, y: x + y
simply adds its operands.
saying
f = lambda x, y : x + y
is almost the same as saying
def f(x, y):
return x + y
in other words lambda returns a function that given the parameters to the left of the : sign will return the value of the expression on the right of it.
In respect to a function is however quite limited, for example allows only one expression and no statements are allowed. This is not a serious problem however because in Python you can define a full function even in the middle of another function and pass that instead.
The usage you shown is however quite bad bacause a lambda there is pointless... Python would allow to compute that sum directly instead of using reduce.
Also, by the way, for the result of that computation there is an easy closed-form solution that doesn't require any iteration at all... (hint: the sum of all numbers from 1 to n is n*(n+1)/2 and the sum of all multiples of 5 from 5 to n is 5*(sum of all numbers from 1 to n/5) ... therefore ...)
A lambda designates an anonymous function. The syntax lambda x,y: x+y can be stated in English as "declare a nameless function taking two parameters named x and y. Perform the operation x+y. The return value of this nameless function will by the result of this operation"
reduce applies some function sequentially to the first two elements of a supplied list, then to the result of that function and the third element, and so on. Therefore, the lambda in the supplied code is used by reduce to add together the elements of the supplied list, which will contain all of the multiples of 3 and 5 less than 1000.

Categories