consider the following dataframe. I have the columns 'product', 'buys', 'buy_again', 'again_index'.
PS: index is buy_again/buys.
product buys buy_again again_index
a 3 2 0.667
b 40 10 0.25
c 2 1 0.5
d 420 70 0.166
e 87 21 0.241
f 28 4 0.142
Now over here, the numbers for buys and buy_again are very skewed and it is unfair to compare product d to product a based on its index. I want to normalize the data using pandas in such a way that the index can be used to directly compare one product to another irrespective of it being new (eg: one with just 3 buys) or old (eg: one with 420 buys) so that the index can be my deciding factor, for a products performance.
Related
I have a dataframe in this format:
ageClass
sex
nationality
treatment
unique_id
netTime
clockTime
0
20
M
KEN
Treatment
354658649da56c20c72b6689d2b7e1b8cc334ac9
7661
7661
1
20
M
KEN
Treatment
1da607e762ac07eba6f9b5a717e9ff196d987242
7737
7737
2
20
M
KEN
Control
1de4a95cef28c290ba5790217288f510afc3b26b
7747
7747
3
30
M
KEN
Control
12215d93d2cb5b0234991a64d097955338a73dd3
7750
7750
4
30
M
KEN
Treatment
5375986567be20b49067956e989884908fb807f6
8163
8163
5
20
M
ETH
Treatment
613be609b3f4a38834c2bc35bffbdb6c47418666
7811
7811
6
20
M
KEN
Control
70fb3284d112dc27a5cad7f705b38bc91f56ecad
7853
7853
7
30
M
AUS
Control
0ea5d606a83eb68c89da0a98543b815e383835e3
7902
7902
8
20
M
BRA
Control
ecdd57df778ad901b41e79dd2713a23cb8f13860
7923
7923
9
20
M
ETH
Control
ae7fe893268d86b7a1bdb4489f9a0798797c718c
7927
7927
The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime.
That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes.
Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.
Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:
df.groupby(['ageClass','treatment'])['clockTime'].mean()
However I was not able to then calculate the difference in the mean times from the resulting series.
How should I move forward?
You can pivot the table with means you produced
df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')
ageClass 20 30
treatment
Control 7862.500000 7826.0
Treatment 7736.333333 8163.0
Then it's easy to find a difference
df2['diff'] = df2[20] - df2[30]
treatment
Control 36.500000
Treatment -426.666667
Name: diff, dtype: float64
From the groupby you've already done, you can groupby index level 0, i.e. 'ageClass' and then use diff to find the difference between the averages of treatment and control groups for each 'ageClass'. Since diff subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control" to make it a bit clearer.
s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')
Output:
ageClass treatment clockTime
0 20 Treatment-Control -126.166667
1 30 Treatment-Control 337.000000
From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most
s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()
s['rank']=s.groupby('ageClass')['clockTime'].rank()
ageClass treatment clockTime rank
0 20 Control 7862.500000 2.0
1 20 Treatment 7736.333333 1.0
2 30 Control 7826.000000 1.0
3 30 Treatment 8163.000000 2.0
Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?
We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64
I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000
I currently have a massive dataset with a large amount of rows and I wanted to create a smaller dataframe that only pulls 2 columns from the larger one and how many times each name occurred in that chapter in this instance 'Occurrence'
The below code is what I am using
df1 = (Dec16.groupby(["BNF Chapter", "Name"]).size().reset_index(name="Occurrence"))
df1
It plots this
BNF Chapter Name Occurrence
1 Aluminium hydroxide 2
1 Aluminium hydroxide + Magnesium trisilicate 2
1 Alverine 702
.......
21 Polihexanide 2
21 Potassium hydroxide 32
21 Sesame oil 22
21 Sodium chloride 222
What I would like to get is the top 10 most occurred names for a certain chapter as the dataset is so large.
For example a dataframe that only pulls
The top 10 most common names in chapter 1
How would I go about doing this?
Many thanks!!!
You can use this pandas.DataFrame.count
This Count Values In Pandas Dataframe here can help you out I hope
DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.