Pandas: Calculated field can't be greater than value from another field - python

I'm trying to create a calculated field (spend) where the value of this field cannot be greater than another field (budget). The spend field is calculated based on two other fields (CPM, Impressions) with the formula ((Impressions/1000)*CPM).
I've created the spend field using the following:
df['spend'] = df['CPM']*(df['Impressions']/1000)
From there, I'm not able to find a solution to apply an if/else condition to the rows within the spend field. If spend > budget, row value should be replaced with the corresponding value from budget. Else, pass and retain the calculated value within spend.
Thank you.

Use Series.mask or min with subset of columns:
df['spend'] = df['spend'].mask(df['spend'] > df['budget'], df['budget'])
df['spend'] = df[['spend', 'budget']].min(axis=1)
Sample:
df = pd.DataFrame({'spend':[1,2,8],
'budget':[4,5,6]})
print (df)
budget spend
0 4 1
1 5 2
2 6 8
df['spend'] = df['spend'].mask(df['spend'] > df['budget'], df['budget'])
print (df)
budget spend
0 4 1
1 5 2
2 6 6
df['spend'] = df[['spend', 'budget']].min(axis=1)
print (df)
budget spend
0 4 1
1 5 2
2 6 6

Another NumPy solution:
df['spend'] = np.where(df['spend'] > df['budget'], df['budget'], df['spend'])

Just get the minimum value:
df['spend'] = np.minimum(df['spend'], df['budget'])

Related

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.
Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Is there a way to make custom function in pandas aggregation function?

Want to apply custom function in a Dataframe
eg. Dataframe
index City Age
0 1 A 50
1 2 A 24
2 3 B 65
3 4 A 40
4 5 B 68
5 6 B 48
Function to apply
def count_people_above_60(age):
** *** #i dont know if the age can or can't be passed as series or list to perform any operation later
return count_people_above_60
expecting to do something like
df.groupby(['City']).agg{"AGE" : ["mean",""count_people_above_60"]}
expected Output
City Mean People_Above_60
A 38 0
B 60.33 2
If performance is important create new column filled by compared values converted to integers, so for count is used aggregation sum:
df = (df.assign(new = df['Age'].gt(60).astype(int))
.groupby(['City'])
.agg(Mean= ("Age" , "mean"), People_Above_60= ('new',"sum")))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2
Your solution should be changed with compare values and sum, but is is slow if many groups or large DataFrame:
def count_people_above_60(age):
return (age > 60).sum()
df = (df.groupby(['City']).agg(Mean=("Age" , "mean"),
People_Above_60=('Age',count_people_above_60)))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

Categories