Selecting values from dataframe based on multiple column values - python

I have a dataframe in this format:
ageClass
sex
nationality
treatment
unique_id
netTime
clockTime
0
20
M
KEN
Treatment
354658649da56c20c72b6689d2b7e1b8cc334ac9
7661
7661
1
20
M
KEN
Treatment
1da607e762ac07eba6f9b5a717e9ff196d987242
7737
7737
2
20
M
KEN
Control
1de4a95cef28c290ba5790217288f510afc3b26b
7747
7747
3
30
M
KEN
Control
12215d93d2cb5b0234991a64d097955338a73dd3
7750
7750
4
30
M
KEN
Treatment
5375986567be20b49067956e989884908fb807f6
8163
8163
5
20
M
ETH
Treatment
613be609b3f4a38834c2bc35bffbdb6c47418666
7811
7811
6
20
M
KEN
Control
70fb3284d112dc27a5cad7f705b38bc91f56ecad
7853
7853
7
30
M
AUS
Control
0ea5d606a83eb68c89da0a98543b815e383835e3
7902
7902
8
20
M
BRA
Control
ecdd57df778ad901b41e79dd2713a23cb8f13860
7923
7923
9
20
M
ETH
Control
ae7fe893268d86b7a1bdb4489f9a0798797c718c
7927
7927
The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime.
That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes.
Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.
Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:
df.groupby(['ageClass','treatment'])['clockTime'].mean()
However I was not able to then calculate the difference in the mean times from the resulting series.
How should I move forward?

You can pivot the table with means you produced
df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')
ageClass 20 30
treatment
Control 7862.500000 7826.0
Treatment 7736.333333 8163.0
Then it's easy to find a difference
df2['diff'] = df2[20] - df2[30]
treatment
Control 36.500000
Treatment -426.666667
Name: diff, dtype: float64

From the groupby you've already done, you can groupby index level 0, i.e. 'ageClass' and then use diff to find the difference between the averages of treatment and control groups for each 'ageClass'. Since diff subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control" to make it a bit clearer.
s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')
Output:
ageClass treatment clockTime
0 20 Treatment-Control -126.166667
1 30 Treatment-Control 337.000000

From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most
s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()
s['rank']=s.groupby('ageClass')['clockTime'].rank()
ageClass treatment clockTime rank
0 20 Control 7862.500000 2.0
1 20 Treatment 7736.333333 1.0
2 30 Control 7826.000000 1.0
3 30 Treatment 8163.000000 2.0

Related

Print summary of frequency of one column that requires two other columns in order to figure out?

I had other issues that I resolved but this problem here has set me back a little bit.
I have the following columns (there's 50,000 total data in my actual file):
Area Date SpeedOver Risk Accident
Wendly 8/8/2010 15 L No
Wendly 2/9/2010 35 L Yes
Reet 1/5/2010 65 M Yes
Reet 9/11/2010 10 M Yes
Sarall 14/3/2010 18 M No
Sarall 7/6/2010 23 H No
Sarall 23/6/2014 25 H Yes
I am trying to print the top 3 locations based on accidents in the year of 2010. So the output should be:
Reet
Wendly
Sarall
top_loc_accident = df[(df.index.year==2010)]['Accident'].nlargest(n=3)
print(top_loc_accident)
But the above code prints the date itself and the accidents, not the actual location name, so I have it 50% correct but it's a bit confusing currently.
You first need to aggregate the number of accidents:
# select rows of 2010
# the original method can be used here
m1 = df['Date'].str.endswith('2010')
# m1 = df.index.year==2010
# identify rows with accidents
m2 = df['Accident'].eq('Yes')
# count the accidents of 2010
# keep the top 3
m2[m1].groupby(df['Area']).sum().nlargest(3)
Output:
Area
Reet 2
Wendly 1
Sarall 0
Name: Accident, dtype: int64

Creating a new column based on entries from another column in Python

I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0

How to normalize varying data in pandas?

consider the following dataframe. I have the columns 'product', 'buys', 'buy_again', 'again_index'.
PS: index is buy_again/buys.
product buys buy_again again_index
a 3 2 0.667
b 40 10 0.25
c 2 1 0.5
d 420 70 0.166
e 87 21 0.241
f 28 4 0.142
Now over here, the numbers for buys and buy_again are very skewed and it is unfair to compare product d to product a based on its index. I want to normalize the data using pandas in such a way that the index can be used to directly compare one product to another irrespective of it being new (eg: one with just 3 buys) or old (eg: one with 420 buys) so that the index can be my deciding factor, for a products performance.

Boxplot: Extract outliers and tag them as either '0' or '1'

I'm trying to extract outliers from my dataset and tag them accordingly.
Sample Data
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
I can do this in Power BI. But being fairly new to Python, I can't seem to figure this out.
This is what I'm trying to achieve:
Algo goes like:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data
Any ideas?
Assuming you use pandas for data analysis (and you should!) You can use pandas dataframe boxplot to produce a plot similar to yours:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))
or, if you want to mark them 0,1 as you requested, use dataframe quantile() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1

Divide 2 columns and create new column with results

I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000

Categories