Sample with different sample sizes per customer - python

I have a data frame as such
Customer Day
0. A 1
1. A 1
2. A 1
3. A 2
4. B 3
5. B 4
and I want to sample from it but I want to sample different sizes for each customer. I have the size of each customer in another dataframe. For example,
Customer Day
0. A 2
1. B 1
Suppose I want to sample per customer per day. So far I have this function:
def sampling(frame,a):
return np.random.choice(frame.Id,size=a)
grouped = frame.groupby(['Customer','Day'])
sampled = grouped.apply(sampling, a=??).reset_index()
If I set the size parameter to a global constant, no problem it runs. But I don't know how to set this when the different values are on a separate dataframe.

You can create a mapper from the df1 with sample size and use that value as sample size,
mapper = df1.set_index('Customer')['Day'].to_dict()
df.groupby('Customer', as_index=False).apply(lambda x: x.sample(n = mapper[x.name]))
Customer Day
0 3 A 2
2 A 1
1 4 B 3
This returns multi-index, you can always reset_index,
df.groupby('Customer').apply(lambda x: x.sample(n = mapper[x.name])).reset_index(drop = True)
Customer Day
0 A 1
1 A 1
2 B 3

Related

Python calculate increment rows till a condition

How to obtain the below result.
Sample Data with Output
Time To default is the column which is to be calculated. We need to get the increment number as Time to default and this increment should be till Default column is 1.
Above shown is for a sample account and the same has to be applied for multiple account id.
Use:
m = df['Default'].eq(1).groupby(df['acct_id']).transform(lambda x: x.shift(fill_value=False).cummax())
df.loc[~m, 'new'] = df[~m].groupby('acct_id').cumcount()
print (df)
acct_id Default new
0 xxx123 0 0.0
1 xxx123 0 1.0
2 xxx123 1 2.0
3 xxx123 1 NaN

Count Number of Times the Sale of a Toy is greater than Average for that Column

I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B" - 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA" and "SaleC" and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average" is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data
You could filter, find mean and sum on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.
Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

Selecting from pandas groups without replacement when possible

Say that I have a Dataframe that looks like:
Name Group_Id
A 1
B 2
C 2
I want a piece of code that selects n sets such that, as much as possible would contain different members of the same group.
A representative from each group must appear in each set (the representatives should be picked at random).
Only if the group's size is smaller than n, the same representatives would appear in multiple sets.
n is smaller or equal to the size of the biggest group.
So for example, for the above Dataframe and n=2 this would be a valid result:
set 1
Name Group_Id
A 1
B 2
set 2
Name Group_Id
A 1
C 2
however this one is not
set 1
Name Group_Id
A 1
B 2
set 2
Name Group_Id
A 1
B 2
One way could be to sample with replacement each group which has a smaller size than that of the largest group, such that each dataframe will have a sample from each group. Then interleave the inner groups' rows, and build a list of dataframes as shared:
# size of largest group
max_size = df.groupby('Group_Id').size().max()
# upsample group if necessary
l = [g.sample(max_size, replace=True) if g.shape[0]<max_size else g
for _,g in df.groupby('Group_Id')]
# interleave rows and build list of dataframes
[pd.DataFrame(g, columns=df.columns) for g in zip(*(i.to_numpy().tolist() for i in l))]
[ Name Group_Id
0 A 1
1 B 2,
Name Group_Id
0 A 1
1 C 2]
Here's an idea:
# 1. label a random order within each Group_Id
df['sets'] = df.sample(frac=1).groupby('Group_Id').cumcount()
# 2. pivot the table and using ffill
sets = (df.pivot(index='sets',columns='Group_Id').ffill() # for groups with fewer than N elements, choose the last elements always
.stack('Group_Id').reset_index('Group_Id') # return Group_Id as a normal column
)
# slices:
N = 2
for i in range(N):
print(sets.loc[i])
Output:
Group_Id Name
sets
0 1 A
0 2 C
Group_Id Name
sets
1 1 A
1 2 B

Pandas: Calculated field can't be greater than value from another field

I'm trying to create a calculated field (spend) where the value of this field cannot be greater than another field (budget). The spend field is calculated based on two other fields (CPM, Impressions) with the formula ((Impressions/1000)*CPM).
I've created the spend field using the following:
df['spend'] = df['CPM']*(df['Impressions']/1000)
From there, I'm not able to find a solution to apply an if/else condition to the rows within the spend field. If spend > budget, row value should be replaced with the corresponding value from budget. Else, pass and retain the calculated value within spend.
Thank you.
Use Series.mask or min with subset of columns:
df['spend'] = df['spend'].mask(df['spend'] > df['budget'], df['budget'])
df['spend'] = df[['spend', 'budget']].min(axis=1)
Sample:
df = pd.DataFrame({'spend':[1,2,8],
'budget':[4,5,6]})
print (df)
budget spend
0 4 1
1 5 2
2 6 8
df['spend'] = df['spend'].mask(df['spend'] > df['budget'], df['budget'])
print (df)
budget spend
0 4 1
1 5 2
2 6 6
df['spend'] = df[['spend', 'budget']].min(axis=1)
print (df)
budget spend
0 4 1
1 5 2
2 6 6
Another NumPy solution:
df['spend'] = np.where(df['spend'] > df['budget'], df['budget'], df['spend'])
Just get the minimum value:
df['spend'] = np.minimum(df['spend'], df['budget'])

Categories