Python random sample selection based on multiple conditions - python

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.
Original Dataset:
Date Id color qty
02-03-2018 A red 5
03-03-2018 B blue 2
03-03-2018 C green 3
04-03-2018 D yellow 4
04-03-2018 E yellow 7
04-03-2018 G yellow 6
04-03-2018 H orange 8
05-03-2018 I yellow 1
06-03-2018 J yellow 5
I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:
df2 = df1.sample(n=df1.shape[0])
df3= df2[df2.qty.cumsum() <= 18]
Required dataset:
Date Id color qty
03-03-2018 B blue 2
04-03-2018 D yellow 4
04-03-2018 G yellow 6
06-03-2018 J yellow 5
Or something like this:
Date Id color qty
02-03-2018 A red 5
04-03-2018 D yellow 4
04-03-2018 E yellow 7
05-03-2018 I yellow 1
Any help would be really appreciated!
Thanks in advance.

Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size
import random
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
Filter rows with other colors and select a random sample for the remaining of your sample size.
others_size = 1 - yellow_size
df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
Combine them both and shuffle the rows.
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
UPDATE:
If you want to check for both conditions simultaneously, this could be one way to do it:
import random
df_sample = df
while sum(df_sample['qty']) > 18:
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)

I would use this package to over sample your yellows into a new sample that has the balance you want:
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
From there just randomly select items and check sum until you have the set you want.
Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.

I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.
df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] = df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)

Related

How can I count instances of a string in a dataframe column of lists that matches the string of a column in a different dataframe?

I have a dataframe containing a column of produce and a column of a list of colors the produce comes in:
import pandas as pd
data = {'produce':['zucchini','apple','citrus','banana','pear'],
'colors':['green, yellow','green, red, yellow','orange, yellow ,green','yellow','green, yellow, brown']}
df = pd.DataFrame(data)
print(df)
Dataframe looks like:
produce colors
0 zucchini green, yellow
1 apple green, red, yellow
2 citrus orange, yellow, green
3 banana yellow
4 pear green, yellow, brown
I am trying to create a second dataframe with each color, and count the number of columns in the first dataframe that have that color. I am able to get the unique list of colors into a dataframe:
#Create Dataframe with a column of unique values
unique_colors = df['colors'].str.split(",").explode().unique()
df2 = pd.DataFrame()
df2['Color'] = unique_colors
print(df2)
But some of the colors repeat some of the time:
Color
0 green
1 yellow
2 red
3 orange
4 green
5 yellow
6 brown
and I am unable to find a way to add a column that counts the instances in the other dataframe. I have tried:
#df['Count'] = data['colors'] == df2['Color']
df['Count'] = ()
for i in df2['Color']:
count=0
if df["colors"].str.contains(i):
count+1
df['Count']=count
but I get the error "ValueError: Length of values (0) does not match length of index (5)"
How can I
make sure values aren't repeated in the list, and
count the instances of the color in the other dataframe
(This is a simplification of a much larger dataframe, so I can't just edit values in the first dataframe to fix the unique color issue).
You need consider the space around , when split. To count the occurrence of color, you can use Series.value_counts().
out = (df['colors'].str.split(' *, *')
.explode().value_counts()
.to_frame('Count')
.rename_axis('Color')
.reset_index())
print(out)
Color Count
0 yellow 5
1 green 4
2 red 1
3 brown 1
4 orange 1
Proposed script
import operator
y_c = (df['colors'].agg(lambda x: [e.strip() for e in x.split(',')])
.explode()
)
clrs = pd.DataFrame.from_dict({c: [operator.countOf(y_c, c)] for c in y_c.unique()})
Two presentations for the result
1 - Horizontal :
print(clrs.rename(index={0:'count'}))
# green yellow red orange brown
# count 4 5 1 1 1
2- Vertical :
print(clrs.T.rename(columns={0:'count'}))
# count
# green 4
# yellow 5
# red 1
# orange 1
# brown 1

Count Number of Times the Sale of a Toy is greater than Average for that Column

I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B" - 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA" and "SaleC" and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average" is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data
You could filter, find mean and sum on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1

Iterate rows and find sum of rows not exceeding a number

Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})

problem with re index dataframe (dealing with categorical data)

i have a data that look like this
subject_id hour_measure urine color heart_rate
3 1 red 40
3 1.15 red 60
4 2 yellow 50
i want to re index data to make 24 hour of measurement for every patient
i use the following code
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
df = df.groupby(['subject_id','hour_measure']).mean().reindex(mux).reset_index()
df.to_csv('totalafterreindex.csv')
it works good with numeric values , but with categorical values it removed it ,
how can i enhance this code to use mean for numeric and most frequent for categorical
the wanted output
subject_id hour_measure urine color heart_rate
3 1 red 40
3 2 red 60
3 3 yellow 50
3 4 yellow 50
.. .. ..
Idea is use GroupBy.agg with mean for numeric and mode for categorical, also is added next with iter for return Nones if mode return empty value:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df1 = df.groupby(['subject_id','hour_measure']).agg(f).reindex(mux).reset_index()
Detail:
print (df.groupby(['subject_id','hour_measure']).agg(f))
urine color heart_rate
subject_id hour_measure
3 1.00 red 40
1.15 red 60
4 2.00 yellow 50
Last if necessary forward filling missing values per subject_id use GroupBy.ffill:
cols = df.columns.difference(['subject_id','hour_measure'])
df[cols] = df.groupby('subject_id')[cols].ffill()

Creating a quality score column in pandas

Hello I am working with a dataframe in pandas which looks something like this
ID Color Size Shape
1 Blue Small Triangle
2 Red Medium Square
3 Yellow Large Circle
I would like to compare each row to a list of data and create a new score column that counts the number of times each row matches the list.
Example [Red, Medium, Circle] would yield the following dataframe.
ID Color Size Shape Score
1 Blue Small Triangle 0
2 Red Medium Square 2
3 Yellow Large Circle 1
Ideally I would like the ability to create multiple score columns to check against multiple lists.
Using isin for data.frame
l=['Red', 'Medium', 'Circle']
df['score']=df.isin(l).sum(1)
df
Out[404]:
ID Color Size Shape score
0 1 Blue Small Triangle 0
1 2 Red Medium Square 2
2 3 Yellow Large Circle 1

Categories