How to map a function using multiple boolen columns in pandas? - python

I have a database of New York apartments which has thousands of rented apartments. What I'm trying to do is create another column based on "pet_level". Their are two other columns 'dog_allowed' and 'cat_allowed' that have a 0 or 1 if the pet is allowed
I'm looking to create the 'pet_level' column on this:
0 if no pets are allowed
1 if cats_allowed
2 if dogs_allowed
3 if both are allowed
my initial approach at solving this was as follows:
df['pet_level'] = df.apply(lambda x: plev(0 = x[x['dog_allowed'] == 0 & x['cat_allowed'] == 0] ,1 = x[x['cat_allowed'] == 1], 2 = x[x['dog_allowed'] == 1], 3 = x[x['dog_allowed'] == 1 & x['cat_allowed'] == 1]))
Just because I've done smaller test datasets in a similar manner
I tried out a lambda function using the apply method but that doesn't seem to allow for that.

The approach that is currently working, define a function with the conditional statements needed.
def plvl(db):
if db['cats_allowed'] == 0 and db['dogs_allowed'] == 0:
val = 0
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 0:
val = 1
elif db['cats_allowed'] == 0 and db['dogs_allowed'] == 1:
val = 2
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 1:
val = 3
return val
Then pass in that function by applying the function along the columns(axis=1) to create the desired column.
df['pet_level'] = df.apply(plvl, axis=1)
I'm not sure if this is the most performance efficient but for testing purposes it currently works. I'm sure there's are more pythonic approaches that would be less demanding and equally helpful to know.

Instead of mapping, you can vectorize the operation like this:
df['pet_level'] = df['dog_allowed'] * 1 + df['cat_allowed'] * 2

Related

how to find out binary number similarity

Recently I have appeared in a coding challenge where one of the question is as below
There are N cars, each having some of M features. A car feature list is given as binary string.
for ex: 0000111
0 means feature not supported.
Two cars are similar if their feature description differs by atmost one feature. for ex: 11001, 11000 are similar for every car, find count of similar cars in a list of cars with features
I have tried to solve the problem with XOR operator. But it worked for few test cases
cars = ["100", "110", "010", "011", "100"]
Here for car at 0th index is similar to 1st and 4th. so output should be 2. similar for all the index car need to be found out with which it is similar.
Solution I tried:
def solution(cars):
ret_list = []
for i in range(len(cars)):
count = 0
for j in range(len(cars)):
if i != j:
if (int(cars[i]) ^ int(cars[j])) <= 100:
count += 1
ret_list.append(count)
print(ret_list)
return ret_list
output : [2, 3, 2, 1, 2]
But this doesn't fit to when input is like:
cars = ["1000", "0110", "0010", "0101", "1010"]
Can someone please suggest a better solution that works for all kind of binary number
Try this:
def isPowerOf2(x):
return (x & (x - 1)) == 0
def areSimilar(x, y):
return isPowerOf2(x ^ y)
def solution(cars):
similarPairCount = 0
for i in range(len(cars) - 1):
for j in range(i + 1, len(cars)):
if areSimilar(int(cars[i], 2), int(cars[j], 2)):
similarPairCount += 1
return similarPairCount
import numpy as np
# ...
# conversion to int
cars_int = np.array([int(c,2) for c in cars]).reshape(1,-1)
# matrix to compare all against all
compare = (cars_int ^ cars_int.T) # or exclusive
# result of cars with 0 or 1 hamming distance
# (not including comparison of each car with itself)
result = (np.sum(compare & (compare-1) == 0) - cars_int.size) // 2
if you want a list with similarity by car:
# ...
result_per_car = np.sum(compare & (compare-1) == 0, axis=1) - 1

Creating a Pandas dataframe column which is conditional on a function

Say I have some dataframe like below and I create a new column (track_len) which gives the length of the column track_no.
import pandas as pd
df = pd.DataFrame({'item_id': [1,2,3], 'track_no': ['qwerty23', 'poiu2', 'poiuyt5']})
df['track_len'] = df['track_no'].str.len()
df.head()
My Question is:
How do I now create a new column (new_col) which selects a specific subset of the track_no string and outputs that depending on the length of the track number (track_len).
I have tried creating a function which outputs the specific string slice of the track_no given the various track_len conditions and then use an apply method to create the column and it doesnt work. The code is below:
Tried:
def f(row):
if row['track_len'] == 8:
val = row['track_no'].str[0:3]
elif row['track_len'] == 5:
val = row['track_no'].str[0:1]
elif row['track_len'] =7:
val = row['track_no'].str[0:2]
return val
df['new_col'] = df.apply(f, axis=1)
df.head()
Thus the desired output should be (based on string slicing output of f):
Output
{new_col: ['qwe', 'p', 'po']}
If there are alternative better solutions to this problem those would also be appreciated.
Your function works well you need to remove .str part in your if blocks. Values are already strings:
def f(row):
if row['track_len'] == 8:
val = row['track_no'][:3]
elif row['track_len'] == 5:
val = row['track_no'][:1]
elif row['track_len'] ==7:
val = row['track_no'][:2]
return val
df['new_col'] = df.apply(f, axis=1)
df.head()
#Output:
item_id track_no track_len new_col
0 1 qwerty23 8 qwe
1 2 poiu2 5 p
2 3 poiuyt5 7 po

Pandas apply function does not assign values to the colum

I am trying to put this logic on pandas dataframe
IF base_total_price > 0
IF base_total_discount = 0
actual_price = base_total_price
IF base_total_discount > 0
actual_price = base_total_price +base_total_discount
IF base_total_price = 0
IF base_total_discount > 0
actual_price = base_total_discount
IF base_total_discount = 0
actual_price = 0
so I wrote these 2 apply functions
#for all entries where base_total_price > 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: df_slice_1['base_total_price'] if x == 0 else df_slice_1['base_total_price']+df_slice_1['base_total_discount'])
#for all entries where base_total_price = 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: x if x == 0 else df_slice_1['base_total_discount'])
When i run the code I get this error
ValueError: Wrong number of items passed 20, placement implies 1
I know that it is trying to put more values in one column but I do not understand why is this happening or how can I solve this problem. All I need to do is to update the dataframe with the new column `actual_price` and I need to calculate the values for this column according to the above mentioned logic. Please suggest me a better way of implementing the logic or correct me
Sample data would have been useful. Please try use np.select(condtions, choices)
Conditions=[(df.base_total_price > 0)&(df.base_total_discount == 0),(df.base_total_price > 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount == 0)]
choices=[df.base_total_price,df.base_total_price.add(df.base_total_discount),df.base_total_discount,0]
df.actual_price =np.select(Conditions,choices)
I solved this question simply by using iterrows. Thanks everyone who responded

What is the correct way of selecting value from pandas dataframe using column name and row index?

what is the most efficient way of selecting value from pandas dataframe using column name and row index (by that I mean row number)?
I have a case where I have to iterate through rows:
I have a working solution:
i = 0
while i < len(dataset) -1:
if dataset.target[i] == 1:
dataset.sum_lost[i] = dataset['to_be_repaid_principal'][i] + dataset['to_be_repaid_interest'][i]
dataset.ratio_lost[i] = dataset.sum_lost[i] / dataset['expected_returned_sum'][i]
else:
dataset.sum_lost[i] = 0
dataset.ratio_lost[i]= 0
i += 1
But this solution is so much RAM hungry. I am also getting the following warning:
"A value is trying to be set on a copy of a slice from a DataFrame."
So I am trying to come up with another one:
i = 0
while i < len(dataset) -1:
if dataset.iloc[i, :].loc['target'] == 1:
dataset.iloc[i, :].loc['sum_lost'] = dataset.iloc[i, :].loc['to_be_repaid_principal'] + dataset.iloc[i, :].loc['to_be_repaid_interest']
dataset.iloc[i, :].loc['ratio_lost'] = dataset.iloc[i, :].loc['sum_lost'] / dataset.iloc[i, :].loc['expected_returned_sum']
else:
dataset.iloc[i, :].loc['sum_lost'] = 0
dataset.iloc[i, :].loc['ratio_lost'] = 0
i += 1
But it does not work.
I would like to come up with a faster/less ram hungry solution, because this will actually be web app a few users could use simultaneously.
Thanks a lot.
If you are thinking about "looping through rows", you are not using pandas right. You should think of terms of columns instead.
Use np.where which is vectorized (read: fast):
cond = dataset['target'] == 1
dataset['sumlost'] = np.where(cond, dataset['to_be_repaid_principal'] + dataset['to_be_repaid_interest'], 0)
dataset['ratio_lost'] = np.where(cond, dataset['sumlost'] / dataset['expected_returned_sum'], 0)

Create Excel-like SUMIFS in Pandas

I recently learned about pandas and was happy to see its analytics functionality. I am trying to convert Excel array functions into the Pandas equivalent to automate spreadsheets that I have created for the creation of performance attribution reports. In this example, I created a new column in Excel based on conditions within other columns:
={SUMIFS($F$10:$F$4518,$A$10:$A$4518,$C$4,$B$10:$B$4518,0,$C$10:$C$4518," ",$D$10:$D$4518,$D10,$E$10:$E$4518,$E10)}
The formula is summing up the values in the "F" array (security weights) based on certain conditions. "A" array (portfolio ID) is a certain number, "B" array (security id) is zero, "C" array (group description) is " ", "D" array (start date) is the date of the row that I am on, and "E" array (end date) is the date of the row that I am on.
In Pandas, I am using the DataFrame. Creating a new column on a dataframe with the first three conditions is straight forward, but I am having difficult with the last two conditions.
reportAggregateDF['PORT_WEIGHT'] = reportAggregateDF['SEC_WEIGHT_RATE']
[(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[:]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[:])].sum()
Obviously the .ix[:] in the last two conditions is not doing anything for me, but is there a way to make the sum conditional on the row that I am on without looping? My goal is to not do any loops, but instead use purely vector operations.
You want to use the apply function and a lambda:
>> df
A B C D E
0 mitfx 0 200 300 0.25
1 gs 1 150 320 0.35
2 duk 1 5 2 0.45
3 bmo 1 145 65 0.65
Let's say I want to sum column C times E but only if column B == 1 and D is greater than 5:
df['matches'] = df.apply(lambda x: x['C'] * x['E'] if x['B'] == 1 and x['D'] > 5 else 0, axis=1)
df.matches.sum()
It might be cleaner to split this into two steps:
df_subset = df[(df.B == 1) & (df.D > 5)]
df_subset.apply(lambda x: x.C * x.E, axis=1).sum()
or to use simply multiplication for speed:
df_subset = df[(df.B == 1) & (df.D > 5)]
print sum(df_subset.C * df_subset.E)
You are absolutely right to want to do this problem without loops.
I'm sure there is a better way, but this did it in a loop:
for idx, eachRecord in reportAggregateDF.T.iteritems():
reportAggregateDF['PORT_WEIGHT'].ix[idx] = reportAggregateDF['SEC_WEIGHT_RATE'][(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[idx]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[idx])].sum()

Categories