I have a large dataset that looks like:
Shop Date Hour Ending Hours Operating Produced
Cornerstop 01-01-2010 0 1 9
Cornerstop 01-01-2010 1 1 11
Cornerstop 01-01-2010 2 1 10
.
.
Cornerstop 01-01-2010 23 1 0
Leaf Grove 01-01-2010 0 1 7
Leaf Grove 01-01-2010 1 1 4
Leaf Grove 01-01-2010 2 1 2
I want to find out which shops are the top 20 shops by how much they've produced. I've used data.describe() to check the top percentiles but this doesn't help me because if I threshold on the top percentile of 'Produced' some days are lost in the data.
This is a newbie question but how can I easily pick and target these top shops based on this criteria? Perhaps use the percentile just to create a range of the top shops and just cut those out in the dataset? Feels like there's a much better way to do this.
Use sort_values() and head():
df.sort_values('Produced', ascending=False).head(20)
If you want to sum the production values for each shop and then sort, you can do:
df.groupby('Shop').agg({'Produced': 'sum'}).sort_values('Produced', ascending=False).head(20)
Use .nlargest
df.groupby('Shop').Produced.sum().nlargest(20)
Add .index.tolist() if you just need a list of Shops.
What about the following to sort the column and then take the top 20?
df= df.sort_values(['Produced'], ascending=[False])
df.head(20)
Related
I have a dataframe in this format:
ageClass
sex
nationality
treatment
unique_id
netTime
clockTime
0
20
M
KEN
Treatment
354658649da56c20c72b6689d2b7e1b8cc334ac9
7661
7661
1
20
M
KEN
Treatment
1da607e762ac07eba6f9b5a717e9ff196d987242
7737
7737
2
20
M
KEN
Control
1de4a95cef28c290ba5790217288f510afc3b26b
7747
7747
3
30
M
KEN
Control
12215d93d2cb5b0234991a64d097955338a73dd3
7750
7750
4
30
M
KEN
Treatment
5375986567be20b49067956e989884908fb807f6
8163
8163
5
20
M
ETH
Treatment
613be609b3f4a38834c2bc35bffbdb6c47418666
7811
7811
6
20
M
KEN
Control
70fb3284d112dc27a5cad7f705b38bc91f56ecad
7853
7853
7
30
M
AUS
Control
0ea5d606a83eb68c89da0a98543b815e383835e3
7902
7902
8
20
M
BRA
Control
ecdd57df778ad901b41e79dd2713a23cb8f13860
7923
7923
9
20
M
ETH
Control
ae7fe893268d86b7a1bdb4489f9a0798797c718c
7927
7927
The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime.
That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes.
Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.
Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:
df.groupby(['ageClass','treatment'])['clockTime'].mean()
However I was not able to then calculate the difference in the mean times from the resulting series.
How should I move forward?
You can pivot the table with means you produced
df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')
ageClass 20 30
treatment
Control 7862.500000 7826.0
Treatment 7736.333333 8163.0
Then it's easy to find a difference
df2['diff'] = df2[20] - df2[30]
treatment
Control 36.500000
Treatment -426.666667
Name: diff, dtype: float64
From the groupby you've already done, you can groupby index level 0, i.e. 'ageClass' and then use diff to find the difference between the averages of treatment and control groups for each 'ageClass'. Since diff subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control" to make it a bit clearer.
s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')
Output:
ageClass treatment clockTime
0 20 Treatment-Control -126.166667
1 30 Treatment-Control 337.000000
From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most
s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()
s['rank']=s.groupby('ageClass')['clockTime'].rank()
ageClass treatment clockTime rank
0 20 Control 7862.500000 2.0
1 20 Treatment 7736.333333 1.0
2 30 Control 7826.000000 1.0
3 30 Treatment 8163.000000 2.0
With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!
I'm trying to extract outliers from my dataset and tag them accordingly.
Sample Data
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
I can do this in Power BI. But being fairly new to Python, I can't seem to figure this out.
This is what I'm trying to achieve:
Algo goes like:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data
Any ideas?
Assuming you use pandas for data analysis (and you should!) You can use pandas dataframe boxplot to produce a plot similar to yours:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))
or, if you want to mark them 0,1 as you requested, use dataframe quantile() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1
I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000
I would like to sort the following dataframe:
Region LSE North South
0 Cn 33.330367 9.178917
1 Develd -36.157025 -27.669988
2 Wetnds -38.480206 -46.089908
3 Oands -47.986764 -32.324991
4 Otherg 323.209834 28.486310
5 Soys 34.936147 4.072872
6 Wht 0.983977 -14.972555
I would like to sort it so the LSE column is reordered based on the list:
lst = ['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht']
of, course the other columns will need to be reordered accordingly as well. Is there any way to do this in pandas?
The improved support for Categoricals in pandas version 0.15 allows you to do this easily:
df['LSE_cat'] = pd.Categorical(
df['LSE'],
categories=['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht'],
ordered=True
)
df.sort('LSE_cat')
Out[5]:
Region LSE North South LSE_cat
3 3 Oands -47.986764 -32.324991 Oands
2 2 Wetnds -38.480206 -46.089908 Wetnds
1 1 Develd -36.157025 -27.669988 Develd
0 0 Cn 33.330367 9.178917 Cn
5 5 Soys 34.936147 4.072872 Soys
4 4 Otherg 323.209834 28.486310 Otherg
6 6 Wht 0.983977 -14.972555 Wht
If this is only a temporary ordering then keeping the LSE column as
a Categorical may not be what you want, but if this ordering is
something that you want to be able to make use of a few times
in different contexts, Categoricals are a great solution.
In later versions of pandas, sort, has been replaced with sort_values, so you would need instead:
df.sort_values('LSE_cat')