Counting the number of entries in a dataframe that satisfies multiple criteria - python

I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.

Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()

Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)

Related

How to group various groups in python into one

I have a dataset which I want to group by the age.
So, here is the first part of the dataset:
It is a simulation for a inventory data. Größe means the number of people with the age (Alter) 15. Risiko gives every person a number and Geschlecht is feminine or masculine.
I want to add a column "Group" and give every people, which have the age 15-19 one number, than with age 20-24 one number and so on.
How can I do this?
You can use map and lambda to create a new column like so :
def return_age_from_range(age):
# Max value in range is excluded, so remember to add +1 to the range you want
if age in range(15, 20):
return 1
elif age in range(20, 25):
return 2
# and so on...
df['group'] = df.alter.map(lambda x: return_age_from_range(x))
Use numpy.select:
In [488]: import numpy as np
In [489]: conds = [df['Alter'].between(15,19), df['Alter'].between(20,24), df['Alter'].between(24,28)]
In [490]: choices = [1,2,3]
In [493]: df['Group'] = np.select(conds, choices)
In [494]: df
Out[494]:
Größe Risiko Geschlecht Alter Group
0 95 1 F 15 1
1 95 2 F 15 1
2 95 3 M 15 1
3 95 4 F 15 1
4 95 5 M 15 1
5 95 6 M 15 1
6 95 7 M 15 1
7 95 8 F 15 1
8 95 9 M 15 1

Generate Column Value in Pandas based on previous rows

Let us assume I am taking a temperature measurement on a regular interval and recording the values in a Pandas Dataframe
day temperature [F]
0 89
1 91
2 93
3 88
4 90
Now I want to create another column which is set to 1 if and only if the two previous values are above a certain level. In my scenario I want to create a column value of 1 if the two consecutive values are above 90, thus yielding
day temperature Above limit?
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
Despite some SO and Google digging, it's not clear if I can use iloc[x], loc[x] or something else in a for loop?
You are looking for the shift function in pandas.
import io
import pandas as pd
data = """
day temperature Expected
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
"""
data = io.StringIO(data)
df = pd.read_csv(data, sep='\s+')
df['Result'] = ((df['temperature'].shift(1) > 90) & (df['temperature'] > 90)).astype(int)
# Validation
(df['Result'] == df['Expected']).all()
Try this:
df = pd.DataFrame({'temperature': [89, 91, 93, 88, 90, 91, 91, 93]})
limit = 90
df['Above'] = ((df['temperature']>limit) & (df['temperature'].shift(1)>limit)).astype(int)
df
In the future, please include the code to testing (in this case the df construction line)
df['limit']=""
df.iloc[0,2]=0
for i in range (1,len(df)):
if df.iloc[i,1]>90 and df.iloc[i-1,1]>90:
df.iloc[i,2]=1
else:
df.iloc[i,2]=0
Here iloc[i,2] refers to ith row index and 2 column index(limit column). Hope this helps
Solution using shift():
>> threshold = 90
>> df['Above limit?'] = 0
>> df.loc[((df['temperature [F]'] > threshold) & (df['temperature [F]'].shift(1) > threshold)), 'Above limit?'] = 1
>> df
day temperature [F] Above limit?
0 0 89 0
1 1 91 0
2 2 93 1
3 3 88 0
4 4 90 0
Try using rolling(window = 2) and then apply() as follows:
df["limit"]=df['temperature'].rolling(2).apply(lambda x: int(x[0]>90)&int(x[-1]> 90))

how to add complementary intervals in pandas dataframe

Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check

Retain feature names after Scikit Feature Selection

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:
def VarianceThreshold_selector(data):
selector = VarianceThreshold(.5)
selector.fit(data)
selector = (pd.DataFrame(selector.transform(data)))
return selector
x = VarianceThreshold_selector(data)
print(x)
changes the following data (this is just a small subset of the rows):
Survived Pclass Sex Age SibSp Parch Nonsense
0 3 1 22 1 0 0
1 1 2 38 1 0 0
1 3 2 26 0 0 0
into this (again just a small subset of the rows)
0 1 2 3
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :
Pclass Age Sibsp Parch
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.
Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.
>>> df
Survived Pclass Sex Age SibSp Parch Nonsense
0 0 3 1 22 1 0 0
1 1 1 2 38 1 0 0
2 1 3 2 26 0 0 0
>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
>>> variance_threshold_selector(df, 0.5)
Pclass Age
0 3 22
1 1 38
2 3 26
>>> variance_threshold_selector(df, 0.9)
Age
0 22
1 38
2 26
>>> variance_threshold_selector(df, 0.1)
Survived Pclass Sex Age SibSp
0 0 3 1 22 1
1 1 1 2 38 1
2 1 3 2 26 0
I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.
However, you can subset the data a bit more cleanly like this:
data_transformed = data.loc[:, selector.get_support()]
There's probably better ways to do this, but for those interested here's how I did:
def VarianceThreshold_selector(data):
#Select Model
selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples
#Fit the Model
selector.fit(data)
features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
features = [column for column in data[features]] #Array of all nonremoved features' names
#Format and Return
selector = pd.DataFrame(selector.transform(data))
selector.columns = features
return selector
As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.
def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
df1 = df.copy(deep=True) # Make a deep copy of the dataframe
selector = VarianceThreshold(thresh)
selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values
return df2
how about this as a code?
columns = [col for col in df.columns]
low_var_cols = []
for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
low_var_cols.append(col)
then drop the columns from the dataframe?
You can use Pandas for thresholding too
data_new = data.loc[:, data.std(axis=0) > 0.75]

Groupby/Sum in Python Pandas - zero counts not showing ...sometimes

The Background
I have a data set of a simulated population of people. They have the following attributes
Age (0-120 years)
Gender (male,female)
Race (white, black, hispanic, asian, other)
df.head()
Age Race Gender in_population
0 32 0 0 1
1 53 0 0 1
2 49 0 1 1
3 12 0 0 1
4 28 0 0 1
There is another variable that identifies the individual as "In_Population"* which is a boolean variable. I am using groupby in pandas to group the population the possible combinations of the 3 attributes to calculate a table of counts by summing the "In_Population" variable in each possible category of person.
There are 2 genders * 5 races * 121 ages = 1210 total possible groups that every individual in the population will fall under.
If a particular group of people in a particular year has no members (e.g. 0 year old male 'other'), then I still want that group to show up in my group-by dataframe, but with a zero in the count. This happens correctly in the data sample below (Age = 0, Gender = {0,1}, and Race = 4). There were no 'other' zero year olds in this particular
grouped_obj = df.groupby( ['Age','Gender','Race'] )
groupedAGR = grouped_obj.sum()
groupedAGR.head(10)
in_population
Age Gender Race
0 0 0 16
1 8
2 63
3 5
4 0
1 0 22
1 4
2 64
3 12
4 0
The issue
This only happens for some of the Age-Gender-Race combinations.
Sometimes the zero sum groups get skipped entirely. The following is the data for age 45. I was expecting to see 0, indicating that there are no 45 year old male 'other' races in this data set.
>>> groupedAGR.xs( 45, level = 'Age' )
in_population
Gender Race
0 0 515
1 68
2 40
3 20
1 0 522
1 83
2 48
3 29
4 3
Notes
*"In_Population"
Basically filters out "newborns" and "immigrants" who are not part of the relevant population when calculating "Mortality Rates"; the deaths in the population happen before immigration and births happen so I exclude them from the calculations. I had a suspicion that this had something to do with it - the zero year olds were showing zero counts but every other age group was not showing anything at all...but that's not the case.
>>> groupedAGR.xs( 88, level = 'Age' )
in_population
Gender Race
0 0 52
2 1
3 0
1 0 62
1 3
2 5
3 3
4 1
There are no 88 year old Asian men in the population, so there's a zero in the category. There are no 88 year old 'other' men in the population either, but they don't show up at all.
EDIT: I added in the code showing how I'm making the group by object in pandas and how I'm summing to find the counts in each group.
Use reindex with a predefined index and fill_value=0
ages = np.arange(21, 26)
genders = ['male', 'female']
races = ['white', 'black', 'hispanic', 'asian', 'other']
sim_size = 10000
midx = pd.MultiIndex.from_product([
ages,
genders,
races
], names=['Age', 'Gender', 'Race'])
sim_df = pd.DataFrame({
# I use [1:-1] to explicitly skip some age groups
'Age': np.random.choice(ages[1:-1], sim_size),
'Gender': np.random.choice(genders, sim_size),
'Race': np.random.choice(races, sim_size)
})
These will have missing age groups
counts = sim_df.groupby(sim_df.columns.tolist()).size()
counts.unstack()
This fills in missing age groups
counts.reindex(midx, fill_value=0).unstack()

Categories