Creating a quality score column in pandas - python

Hello I am working with a dataframe in pandas which looks something like this
ID Color Size Shape
1 Blue Small Triangle
2 Red Medium Square
3 Yellow Large Circle
I would like to compare each row to a list of data and create a new score column that counts the number of times each row matches the list.
Example [Red, Medium, Circle] would yield the following dataframe.
ID Color Size Shape Score
1 Blue Small Triangle 0
2 Red Medium Square 2
3 Yellow Large Circle 1
Ideally I would like the ability to create multiple score columns to check against multiple lists.

Using isin for data.frame
l=['Red', 'Medium', 'Circle']
df['score']=df.isin(l).sum(1)
df
Out[404]:
ID Color Size Shape score
0 1 Blue Small Triangle 0
1 2 Red Medium Square 2
2 3 Yellow Large Circle 1

Related

How can I count instances of a string in a dataframe column of lists that matches the string of a column in a different dataframe?

I have a dataframe containing a column of produce and a column of a list of colors the produce comes in:
import pandas as pd
data = {'produce':['zucchini','apple','citrus','banana','pear'],
'colors':['green, yellow','green, red, yellow','orange, yellow ,green','yellow','green, yellow, brown']}
df = pd.DataFrame(data)
print(df)
Dataframe looks like:
produce colors
0 zucchini green, yellow
1 apple green, red, yellow
2 citrus orange, yellow, green
3 banana yellow
4 pear green, yellow, brown
I am trying to create a second dataframe with each color, and count the number of columns in the first dataframe that have that color. I am able to get the unique list of colors into a dataframe:
#Create Dataframe with a column of unique values
unique_colors = df['colors'].str.split(",").explode().unique()
df2 = pd.DataFrame()
df2['Color'] = unique_colors
print(df2)
But some of the colors repeat some of the time:
Color
0 green
1 yellow
2 red
3 orange
4 green
5 yellow
6 brown
and I am unable to find a way to add a column that counts the instances in the other dataframe. I have tried:
#df['Count'] = data['colors'] == df2['Color']
df['Count'] = ()
for i in df2['Color']:
count=0
if df["colors"].str.contains(i):
count+1
df['Count']=count
but I get the error "ValueError: Length of values (0) does not match length of index (5)"
How can I
make sure values aren't repeated in the list, and
count the instances of the color in the other dataframe
(This is a simplification of a much larger dataframe, so I can't just edit values in the first dataframe to fix the unique color issue).
You need consider the space around , when split. To count the occurrence of color, you can use Series.value_counts().
out = (df['colors'].str.split(' *, *')
.explode().value_counts()
.to_frame('Count')
.rename_axis('Color')
.reset_index())
print(out)
Color Count
0 yellow 5
1 green 4
2 red 1
3 brown 1
4 orange 1
Proposed script
import operator
y_c = (df['colors'].agg(lambda x: [e.strip() for e in x.split(',')])
.explode()
)
clrs = pd.DataFrame.from_dict({c: [operator.countOf(y_c, c)] for c in y_c.unique()})
Two presentations for the result
1 - Horizontal :
print(clrs.rename(index={0:'count'}))
# green yellow red orange brown
# count 4 5 1 1 1
2- Vertical :
print(clrs.T.rename(columns={0:'count'}))
# count
# green 4
# yellow 5
# red 1
# orange 1
# brown 1

Sort through pandas dataframe with columns of different lengths to find matches

I have a dataframe with 4 columns, but different numbers of rows and several blank (e.g., isnull / NaN) cells.
For example, here's a baby dataframe with the situation:
data = {'mrn1':[1,1,1,2,2,3,3,3,4],
'race1':['', '', '', 'white', 'white', 'black', 'black', 'black', ''],
'mrn2':[1,1,3,3,4,6],
'race2':['black', 'black','black','black', 'white', 'white']}
dfx = pd.DataFrame.from_dict(data, orient='index')
dfx = dfx.transpose()
which produces:
mrn1
race1
mrn2
race2
1
1
black
1
1
black
1
3
black
2
white
3
black
2
white
4
white
3
black
6
white
3
black
3
black
4
What I am trying to accomplish is to fill in the missing race1 column (not concerned about race2) by iterating through both mrn1 and mrn2, and where there is a match, assign race2 in the blank (missing) race1 cell. Alternatively (perhaps less messy), create and append a new race3 column that indexes to mrn1 and assigns race2 if race1 == null and otherwise assigns race1.
desired final product:
mrn1
race1
mrn2
race2
race3
1
1
black
black
1
1
black
black
1
3
black
black
2
white
3
black
white
2
white
4
white
white
3
black
6
white
black
3
black
black
3
black
black
4
white
I have attempted several iterations of join, merge (all types), and df.iterrows to no avail. In all cases, I can only coordinate the matching and assignment if/when the mrn1 and mrn2 rows match. There are no variations of code that come close, so there is no use providing any examples of code I have tried. Any help will be most welcome. Thanks!
Try the following:
dfx.loc[(dfx.mrn1==dfx.mrn2) & (dfx.race1==''), 'race1'] = dfx.loc[(dfx.mrn1==dfx.mrn2) & (dfx.race1==''), 'race2']
One approach would be to create a master DataFrame of the race values using concat to stack races together. Then dropna to remove missing rows and drop_duplicates (keeping those from race1 and removing those from race2) then merge back to the initial DataFrame:
key_df = (
pd.concat([
dfx[['mrn1', 'race1']]
.mask(dfx['race1'].eq('')) # Convert empty string to NaN
.set_axis(['mrn1', 'race3'], axis=1), # Rename to match output
dfx[['mrn2', 'race2']]
.set_axis(['mrn1', 'race3'], axis=1) # Rename to match output
]).dropna() # Drop NaN
# Drop Duplicates (keeping values the first values from race1)
.drop_duplicates(['mrn1', 'race3'], keep='first')
)
# Merge DataFrame to get new column
dfx = dfx.merge(key_df, on='mrn1')
dfx:
mrn1 race1 mrn2 race2 race3
0 1 1 black black
1 1 1 black black
2 1 3 black black
3 2 white 3 black white
4 2 white 4 white white
5 3 black 6 white black
6 3 black None None black
7 3 black None None black
8 4 None None white
key_df for reference:
mrn1 race3
0 2 white
1 3 black
2 1 black
4 4 white
5 6 white
A more compacted version without the comments and taking advantage of the various function defaults:
cols = ['mrn1', 'race3']
dfx = dfx.merge(
pd.concat([
dfx[['mrn1', 'race1']].mask(dfx['race1'].eq(''))
.set_axis(cols, axis=1),
dfx[['mrn2', 'race2']].set_axis(cols, axis=1)
]).dropna().drop_duplicates(cols)
)

How to add a new column and fill it up with a specific value depending on another column's series?

I'm new to Pandas but thanks to Add column with constant value to pandas dataframe I was able to add different columns at once with
c = {'new1': 'w', 'new2': 'y', 'new3': 'z'}
df.assign(**c)
However I'm trying to figure out what's the path to take when I want to add a new column to a dataframe (currently 1.2 million rows * 23 columns).
Let's simplify the df a bit and try to make it more clear:
Order Orderline Product
1 0 Laptop
1 1 Bag
1 2 Mouse
2 0 Keyboard
3 0 Laptop
3 1 Mouse
I would like to add a new column where depending if the Order has at least 1 product == Bag then it should be 1 (for all rows for that specific order), otherwise 0.
Result would become:
Order Orderline Product HasBag
1 0 Laptop 1
1 1 Bag 1
1 2 Mouse 1
2 0 Keyboard 0
3 0 Laptop 0
3 1 Mouse 0
What I could do is find all the unique order numbers, then filter out the subframe, check the Product column for Bag, if found then add 1 to a new column, otherwise 0, and then replace the original subframe with the result.
Likely there's a way better manner to accomplish this, and also way more performant.
The main reason I'm trying to do this, is to flatten things down later on. Every order should become 1 line with some values of product. I don't need the information for Bag anymore but I want to keep in my dataframe if the original order used to have a Bag (1) or no Bag (0).
Ultimately when the data is cleaned out it can be used as a base for scikit-learn (or that's what I hope).
If I understand you correctly, you want GroupBy.transform.any
First we create a boolean array by checking which rows in Product are Bag with Series.eq. Then we GroupBy on this boolean array and check if any of the values are True. We use transform to keep the shape of our initial array so we can assign the values back.
df['ind'] = df['Product'].eq('Bag').groupby(df['Order']).transform('any').astype(int)
Order Orderline Product ind
0 1 0 Laptop 1
1 1 1 Bag 1
2 1 2 Mouse 1
3 2 0 Keyboard 0
4 3 0 Laptop 0
5 3 1 Mouse 0

Convert data frame with a single column to a difference matrix

I have a data frame with a single column of values and an index of sample names:
>>> df = pd.DataFrame(data={'value':[1,3,4]},index=['cat','dog','bird'])
>>> print(df)
value
cat 1
dog 3
bird 4
I would like to convert this to a square matrix wherein each cell of the matrix shows the difference between every set of two values:
cat dog bird
cat 0 2 3
dog 2 0 1
bird 3 1 0
Is this possible? If so, how do I go about doing this?
I have tried to use scipy.spatial.distance.squareform to convert my starting data frame into a matrix, but apparently what I am starting with is not the right type of vector. Any help would be much appreciated!

Python random sample selection based on multiple conditions

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.
Original Dataset:
Date Id color qty
02-03-2018 A red 5
03-03-2018 B blue 2
03-03-2018 C green 3
04-03-2018 D yellow 4
04-03-2018 E yellow 7
04-03-2018 G yellow 6
04-03-2018 H orange 8
05-03-2018 I yellow 1
06-03-2018 J yellow 5
I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:
df2 = df1.sample(n=df1.shape[0])
df3= df2[df2.qty.cumsum() <= 18]
Required dataset:
Date Id color qty
03-03-2018 B blue 2
04-03-2018 D yellow 4
04-03-2018 G yellow 6
06-03-2018 J yellow 5
Or something like this:
Date Id color qty
02-03-2018 A red 5
04-03-2018 D yellow 4
04-03-2018 E yellow 7
05-03-2018 I yellow 1
Any help would be really appreciated!
Thanks in advance.
Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size
import random
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
Filter rows with other colors and select a random sample for the remaining of your sample size.
others_size = 1 - yellow_size
df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
Combine them both and shuffle the rows.
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
UPDATE:
If you want to check for both conditions simultaneously, this could be one way to do it:
import random
df_sample = df
while sum(df_sample['qty']) > 18:
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
I would use this package to over sample your yellows into a new sample that has the balance you want:
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
From there just randomly select items and check sum until you have the set you want.
Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.
I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.
df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] = df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)

Categories